内容简介:Databases are commonly used as dumb storage bins forThe standard way of interfacing with Postgres is SQL – you know, that thing you may have briefly learned in webdev class where youIn this post, I’ll present a surprisingly simple recursive CTEthat impleme
Databases are commonly used as dumb storage bins for CRUD application data. However, this doesn’t do them justice: database systems such as PostgreSQL are much more capable and versatile than that, supporting a wide range of operations across many different data types .
The standard way of interfacing with Postgres is SQL – you know, that thing you may have briefly learned in webdev class where you SELECT
stuff FROM
a set of tables WHERE
some condition is met. But that’s a far cry from what you can achieve when taking advantage of the full feature set of this declarative and – surprise! – Turing complete programming language.
In this post, I’ll present a surprisingly simple recursive CTEthat implements an iterated function system for generating a variant of the Sierpiński triangle , along with a basic SVG code generation routine for visualization purposes. The end result will look like this:
This post is loosely based on what I learned in the Advanced SQL lecture held by Torsten Grust in the summer of 2017 at the University of Tübingen. Take a look at the lecture slides for in-depth explanations and a wide array of examples.
Theory
The Sierpiński triangle , which you’ve already seen above, is a self-similar fractal that takes the shape of an equailateral triangle that’s divided into four “subtriangles”: the middle one is blank, the rest are again divided in the same manner, ad infinitum . It can be constructed in a variety of ways , allof which are interesting – but we’ll focus on just one of them in this post.
Iterated function systems , meanwhile, are commonly utilized in the generation of self-similar fractals like the one we’ll be drawing today. Wikipedia defines them as follows:
Formally, an iterated function system is a finite set of contraction mappings on a complete metric space.
There you go.
Just kidding – that sentence makes no sense to me because I’m bad at math, and I hope you won’t be offended if I proceed under the assumption that you’re not much better at it. Informally , an iterated function system (IFS) is a set of functions that can be applied to each other’s outputs in any order and as many times as you like , with the results ending up distributed in a hopefully-interesting manner.
Wikipedia goes on to outline the particular application of IFS we’ll be performing:
The most common algorithm to compute IFS fractals is called the “ chaos game ”. It consists of picking a random point in the plane, then iteratively applying one of the functions chosen at random from the function system to transform the point to get a next point.
The functions, in this case, simply move the point halfway to one of three anchor points
arranged in a triangular configuration. Assuming our canvas is a unit square(with the origin located at the top left because that’s where it usually is in computer graphics):
You’ll notice that these points don’t quite form an equilateral triangle, but that doesn’t matter – we’ll simply get a slightly stretched Sierpiński triangle as a result. (If we moved
off-center, we’d get a skewed triangle instead.)
Let’s define our function system. It comprises three functions, each corresponding to one of the anchor points. Every function moves its input point halfway to its assigned anchor point:
Now! If we start out with
, feed it into one of the three functions at random, feed the result back into another random function, and keep doing that foriterations, the Sierpiński triangle slowly emerges from the initial noise.
You can observe this in the followingimage – each subplot, from top left to bottom right, advancesthe process by 10 iterations.
Recursive CTEs
Let’s get to know the SQL feature that will be instrumental in implementing our IFS!
Common table expressions (CTEs) are incredibly useful for structuring complex queries, allowing the query writer to define temporary VIEW
-like constructs. They help with breaking up a large query into potentially reusable components: The WITH
keyword is used to chain queries together, while assigning a name to each component query by which its results can be referenced in all following queries within the same WITH
block.
A WITH
block is terminated by a standard query which may reference any of the named components – frequently, this is just a TABLE
statement.
As an example, this basicCTE computes the sum of squared residuals between a predicted result and a sequenceof observations:
WITH predicted(id, n) AS ( SELECT s.n, exp(s.n) FROM generate_series(0, 4) AS s(n) ), observed(id, n) AS ( VALUES (0, 2), (1, 3), (2, 7), (3, 19), (4, 53) ), residuals(id, n) AS ( SELECT p.id, o.n - p.n FROM predicted p JOIN observed o ON p.id = o.id ), squares(id, n) AS ( SELECT id, n ^ 2 FROM residuals ), sum(n) AS ( SELECT sum(n) FROM squares ) TABLE sum;
Recursive CTEs , recognizable by the WITH RECURSIVE
keywords, have another ace up their sleeve: They can run a self-referential query multiple times until it produces no further rows. Somewhat unintuitively, this is an iterative process.
Let’s consider an example which computes the factorial of the number 7:
WITH RECURSIVE fac(n, res) AS ( SELECT 7, 1 UNION ALL SELECT n - 1, res * n FROM fac WHERE n > 1 ) SELECT res FROM fac WHERE n = 1;
A recursive CTE must be composed of three parts , each with their own semantics:
-
A non-recursive term – here,
SELECT 7, 1
.This query is evaluated exactly once at the beginning of the iterative process. Its results are inserted into both a temporary working table and, after the second part ↓ is considered, into the output table . This constitutes the first iteration of the CTE.
-
An operator that steers how results are combined after each iteration: either
UNION ALL
orUNION
.In the latter case, rows of the working table that are duplicates with regard to both itself and the output table are eliminated in-between iterations and before the contents of the working table are appended to the output table .
-
The recursive term , which may reference the current CTE (here:
fac
) in itsFROM
list or in subqueries.Such references are resolved to the working table , which always only contains rows created in the previous iteration: its old contents are flushed to the output table after each iteration and replaced with the current iteration’s result.
If and only if the working table is empty at the beginning of an iteration, the process terminates.
In Postgres, the recursive term of a CTE merely supports a subset of the usual SQL querying toolbox. Most notably, ORDER BY
cannotbe used – but it’s allowed within subqueries, which makes this particular limitation a non-issue in practice.
Implementation
Hoping that you’ve grokked the general concepts of iterated function systems and recursive CTEs, let’s begin implementing our Sierpiński triangle generator in SQL.
First, we’ll need to define a tableto house the three anchor points
– nothing fancy, we don’t even need anid
column:
CREATE TABLE anchors ( x float, y float ); INSERT INTO anchors VALUES (0.5, 0), (0, 1), (1, 1);
Before writing the query, we’ll define how many iterations of our function system we want to compute. psql
’s \set
functionality comes in handy here (note that 1e4
is just a shorthand for
).
\set n 1e4
Let’s now get to writing the recursive query that will play the chaos game for us!
I’ll reveal it in stages in roughly the same order I formulated the query in, beginning with the schema of our RECURSIVE
ly built-up table: We need to store
id
column which is going to be incremented in each iteration until it eclipses
:n
, which is our termination condition.
WITH RECURSIVE points(id, x, y) AS ( ... ) ...
Now, the non-recursive term must select a starting value of our point
– we’re just going to pick the center of our unit square for good measure.
WITH RECURSIVE points(id, x, y) AS ( SELECT 1, 0.5::float, 0.5::float ... ... ) ...
Alright! Since we’ll be incrementing the id
column in each iteration, we’ll never create any duplicates. Hence:
WITH RECURSIVE points(id, x, y) AS ( SELECT 1, 0.5::float, 0.5::float UNION ALL ... ) ...
Speaking of incrementing the id
column in each iteration – let’s do that while also implementing our termination condition at the same time. Once we run the query, :n
is going to be substituted with the value of the psql
variable we’ve set up earlier. Remember that only the row added in the previous iteration is visible to the recursive term, otherwise we’d get exponential instead of linear growth.
WITH RECURSIVE points(id, x, y) AS ( SELECT 1, 0.5::float, 0.5::float UNION ALL SELECT p.id + 1, ..., ... FROM points p, ... WHERE p.id < :n ) ...
The next step is the selection of a random anchor point (i.e. implicitly the random selection among
) from theanchors
table. You might be inclined to formulate this as follows…
WITH RECURSIVE points(id, x, y) AS ( SELECT 1, 0.5::float, 0.5::float UNION ALL SELECT p.id + 1, ..., ... FROM points p, anchors a WHERE p.id < :n ORDER BY random() -- ⚠ this won't work LIMIT 1 ) ...
…but that won’t work because, as previously discussed, Postgres doesn’t support ORDER BY
in the outermost query of the recursive term. Back to the drawing board – let’s try a subquery:
WITH RECURSIVE points(id, x, y) AS ( SELECT 1, 0.5::float, 0.5::float UNION ALL SELECT p.id + 1, ..., ... FROM points p, (SELECT * FROM anchors ORDER BY random() LIMIT 1) AS a(x, y) WHERE p.id < :n ) ...
That’s perhaps a bit less elegant, but it’ll work just fine.
Finally, all that’s left to do is implementing the computation of the updated position of our point
– we can basically just fill in the formula for from above. We’ll also utilizeTABLE points
at the very end to output the results.
WITH RECURSIVE points(id, x, y) AS ( SELECT 1, 0.5::float, 0.5::float UNION ALL SELECT p.id + 1, (p.x + a.x) / 2, (p.y + a.y) / 2 FROM points p, (SELECT * FROM anchors ORDER BY random() LIMIT 1) AS a(x, y) WHERE p.id < :n ) TABLE points;
That’s it! Our iterative function system iterates as expected, which we can verify by taking a look at the query result:
+-------+----------------------+---------------------+ | id | x | y | +-------+----------------------+---------------------+ | 1 | 0.5 | 0.5 | | 2 | 0.5 | 0.25 | | 3 | 0.25 | 0.625 | | 4 | 0.625 | 0.8125 | | 5 | 0.3125 | 0.90625 | | 6 | 0.40625 | 0.453125 | | 7 | 0.453125 | 0.2265625 | | 8 | 0.4765625 | 0.11328125 | | 9 | 0.23828125 | 0.556640625 | | 10 | 0.619140625 | 0.7783203125 | | 11 | 0.8095703125 | 0.88916015625 | | 12 | 0.90478515625 | 0.944580078125 | | 13 | 0.952392578125 | 0.9722900390625 | | 14 | 0.7261962890625 | 0.48614501953125 | | 15 | 0.86309814453125 | 0.743072509765625 | | 16 | 0.681549072265625 | 0.371536254882812 | | 17 | 0.840774536132812 | 0.685768127441406 | | 18 | 0.420387268066406 | 0.842884063720703 | | 19 | 0.710193634033203 | 0.921442031860352 | | 20 | 0.605096817016602 | 0.460721015930176 | | 21 | 0.802548408508301 | 0.730360507965088 | | 22 | 0.40127420425415 | 0.865180253982544 | | 23 | 0.450637102127075 | 0.432590126991272 | | 24 | 0.475318551063538 | 0.216295063495636 | | 25 | 0.487659275531769 | 0.108147531747818 | | 26 | 0.493829637765884 | 0.054073765873909 | | 27 | 0.246914818882942 | 0.527036882936954 | | 28 | 0.123457409441471 | 0.763518441468477 | | 29 | 0.311728704720736 | 0.381759220734239 | | 30 | 0.155864352360368 | 0.690879610367119 | ┆ ┆ ┆ ┆ | 9990 | 0.363056443960706 | 0.85909628057552 | | 9991 | 0.431528221980353 | 0.42954814028776 | | 9992 | 0.465764110990177 | 0.21477407014388 | | 9993 | 0.482882055495088 | 0.10738703507194 | | 9994 | 0.241441027747544 | 0.55369351753597 | | 9995 | 0.370720513873772 | 0.276846758767985 | | 9996 | 0.435360256936886 | 0.138423379383993 | | 9997 | 0.217680128468443 | 0.569211689691996 | | 9998 | 0.358840064234222 | 0.284605844845998 | | 9999 | 0.179420032117111 | 0.642302922422999 | | 10000 | 0.589710016058555 | 0.8211514612115 | +-------+----------------------+---------------------+ (10000 rows)
This looks like it might be right, but how can we tell? I, for what it’s worth, can’t picture this sequence of points in my mind, so let’s visualize it!
There’s a plethora of toolswe could feed this table into – but I think it’s more fun to stay within the confines of SQL for now by generating a basic SVG image, which every web browser can display.
Visualization
SVG, in case you’re not familiar with it, is …
…an Extensible Markup Language (XML)-based vector image format for two-dimensional graphics with support for interactivity and animation.
Interactivity and animation are cool, but for our purposes, only twoof SVG’s various tags are relevant:
-
The
<svg>
tag serves as the root element of the document, quite similar to HTML’s<html>
tag. ItsviewBox
attribute demarcates the visible portion of the internal coordinate system – its value can be interpreted asleft_edge bottom_edge width height
.<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1000 1000"> <!-- document content goes here --> </svg>
-
The
<circle>
tag draws a circle, whose radius must be defined via ther
attribute, centered on the coordinate specified by thecx
andcy
attributes. We’ll use this tag to draw both our anchor points and the point generated in each iteration. Its appearance can be defined with CSS located in thestyle
attribute (if not overridden in this manner, appearance is inherited from the<svg>
element’sstyle
attribute).In the following example, a solid black circle is drawn.
<circle cx="42" cy="1337" r="1" />
Armed with this knowledge, we can use SQL’s string concatenation operator ||
and a couple of subqueries to generate our image:
\set width 1000 \set height 1000 WITH RECURSIVE points(id, x, y) AS ( ... ), svg(text) AS ( SELECT '<svg xmlns="http://www.w3.org/2000/svg" viewBox="-3 -3 ' || :width + 6 || ' ' || :height + 6 || '">' || (SELECT string_agg('<circle cx="' || x * :width || '" cy="' || y * :height || '" r="1" />', '') FROM points) || (SELECT string_agg('<circle cx="' || x * :width || '" cy="' || y * :height || '" r="3" />', '') FROM anchors) || '</svg>' ) TABLE svg;
But if you execute this query, Postgres will present you with this:
+---------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------------------------
And so on – the giant string we’ve assembled is printed in the usual tabular output format, with the table’s toprule taking up multiple screens of the terminal window. We could scroll way down, find the generated SVG code, copy-paste it into a text editor, save the buffer as an .svg
, and open that file. But that’s not very satisfying, so instead, let’s invoke psql
with the --quiet
flag and pipe the query result into a file:
psql --quiet -f sierpinsky.sql > sierpinsky.svg
That’s actually not quite enough. We need to execute a few more configuration commands…
\pset border 0 \pset footer off \pset pager off \pset tuples_only on \timing off
…to disable various aspects of the default output format. But once that’s done, we can lean back and bask in the glory of what we’ve achieved with just a few lines of SQL:
And in case you haven’t been coding along, the finished sierpinsky.sql
looks just about like this:
\set n 1e4 \set width 1000 \set height 1000 -- shut up, w̶e̶s̶l̶e̶y̶postgres \pset border 0 \pset footer off \pset pager off \pset tuples_only on \timing off -- set up anchor points DROP TABLE IF EXISTS anchors; CREATE TABLE anchors ( x float, y float ); INSERT INTO anchors VALUES (0.5, 0), (0, 1), (1, 1); --TRUNCATE anchors; --INSERT INTO anchors VALUES -- (0.5, 0), -- (0, 0.4), -- (1, 0.4), -- (0.2, 1), -- (0.8, 1); -- let's-a go! WITH RECURSIVE points(id, x, y) AS ( SELECT 1, 0.5::float, 0.5::float UNION ALL SELECT p.id + 1, (p.x + a.x) / 2, (p.y + a.y) / 2 FROM points p, (SELECT * FROM anchors ORDER BY random() LIMIT 1) AS a(x, y) WHERE p.id < :n ), svg(text) AS ( SELECT '<svg xmlns="http://www.w3.org/2000/svg" viewBox="-3 -3 ' || :width + 6 || ' ' || :height + 6 || '">' || (SELECT string_agg('<circle cx="' || x * :width || '" cy="' || y * :height || '" r="1" />', '') FROM points) || (SELECT string_agg('<circle cx="' || x * :width || '" cy="' || y * :height || '" r="3" />', '') FROM anchors) || '</svg>' ) TABLE svg;
Addendum: Evolution visualization
In order to generate the “evolution” visualization shown at the bottom of the “Theory” section, I’ve adapted and parameterized the query detailed in this post to output a small multiples chart . In case you’re interested, the following code snippet replaces everything below the -- let's-a go!
comment:
\set width 100 \set height 100 \set rows 8 \set cols 8 \set fac 10 \set point_r 1 \set anchor_r 3 WITH RECURSIVE points(row, col, id, x, y) AS ( SELECT row, col, 1, 0.5::float, 0.5::float FROM generate_series(0, (SELECT :rows - 1)) AS row(row), generate_series(0, (SELECT :cols - 1)) AS col(col) UNION ALL SELECT p.row, p.col, p.id + 1, (p.x + a.x) / 2, (p.y + a.y) / 2 FROM points p, (SELECT * FROM anchors ORDER BY random() LIMIT 1) AS a(x, y) WHERE p.id < :fac * (p.row * :cols + p.col) --WHERE p.id < p.col * (:fac ^ p.row) ), svg(text) AS ( SELECT '<svg xmlns="http://www.w3.org/2000/svg" viewBox="-' || :anchor_r || ' -' || :anchor_r || ' ' || :width * (:cols + 1) + 2 * :anchor_r || ' ' || :height * (:rows + 1) + 2 * :anchor_r || '">' || (SELECT string_agg('' || (SELECT string_agg('<circle cx="' || (col + x + col / (:cols - 1.0)) * :width || '" cy="' || (row + y + row / (:rows - 1.0)) * :height || '" r="' || :point_r || '" />', '') FROM points WHERE row = p.row AND col = p.col) || (SELECT string_agg('<circle cx="' || (col + x + col / (:cols - 1.0)) * :width || '" cy="' || (row + y + row / (:rows - 1.0)) * :height || '" r="' || :anchor_r || '" />', '') FROM anchors WHERE row = p.row AND col = p.col) , '') FROM points p WHERE p.id = 1) || '</svg>' ) TABLE svg;
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
大数据时代小数据分析
屈泽中 / 电子工业出版社 / 2015-7-1 / 69.00元
《大数据时代小数据分析》是一本大数据时代下进行小数据分析的入门级教材,通过数据分析的知识点,将各类分析工具进行串联和对比,例如:在进行线性规划的时候可以选择使用Excel或LINGO或Crystal Ball。工具的应用难易结合,让读者循序渐进地学习相关工具。JMP和Mintab用来分析数据,分析的结果使用Excel、LINGO、Crystal Ball来建立数据模型,最后使用Xcelsius来动......一起来看看 《大数据时代小数据分析》 这本书的介绍吧!