内容简介:Here is just a quick look at how those bidirectional parsers look like in code:You can see a bigger example in the repository README:
tomland
is a bidirectional TOML parsing and pretty-printing library. “Bidirectional” here means that the library allows the user to specify in one place how to convert Haskell data types to TOML format and how to parse TOML configurations to custom user types. This approach allows eliminating one layer of bugs, specifically, when you update your data type and only one part of the conversion (for example, parser) but accidentally forget to update the other one (printer). With the bidirectional approach implemented in tomland
there’s no chance of you overlooking anything.
Here is just a quick look at how those bidirectional parsers look like in code:
data User = User { userName :: Text , userAge :: Int } userCodec :: TomlCodec User userCodec = User <$> Toml.text "name" .= userName <*> Toml.int "age" .= userAge
You can see a bigger example in the repository README:
The idea is straightforward but the implementation requires some tricks. As a result of this, in order to provide convenient and type-safe interfaces tomland
uses the following non-trivial Haskell techniques:
- GADTs and existential wrappers for core types and theorem-proving for more compile-time guarantees.
- Category theory and Category instance for the custom data type in order to implement tagged partial bidirectional isomorphism .
- Monadic profunctors approach to compose those isomorphisms.
- Custom implementation of the prefix tree specific to the structure of nested keys in TOML format.
- Property-based tests with the hedgehog library that covers typeclass laws and roundtrip properties.
- Type-safe EDSL for specifying TOML AST.
Despite the fact that a lot of fancy features are used, the library is still well-tested. The whole code of the library itself comprises 1600 LOC and 1100 LOC of tests. But without the powerful Haskell type system, the size of such tests could be much bigger.
This blog post describes the architecture and details of the implementation of tomland
.
Why bidirectional TOML?
TOML is great for static configurations. Its specification is very simple but expressive enough to cover most cases. Also, it’s unambiguous and human-readable which makes it much smoother to read and edit the configuration.
The state of TOML parsing libraries in the Haskell ecosystem can still be improved. There are already several libraries:
But not all of them have nice ways to convert a TOML AST to custom Haskell types. For example, htoml
implements this conversion via ToJSON/FromJSON
typeclasses from the aeson
library and you might not always want to have extra dependencies if you don’t need JSON. Also, none of the libraries have pretty-printing abilities. Usually, you don’t need pretty-printing for TOML but there were at least two use-cases in my experience where I needed pretty-printing:
-
summoner
: Summoner is the tool for scaffolding modern Haskell projects. It can read TOML configuration where you can specify some settings in advance. But for the moment, you need to write this TOML config file manually. Though there are plans to implement a feature that allows you to build your configuration settings interactively via
summoner
itself and for this we need an ability to convert theSettings
data type to TOML configuration. -
life-sync
: this tool allows you to sync your configs across multiple machines. It stores all tracked files in its own TOML configuration file. In order to keep this file up-to-date,
life-sync
needs to parse from this configuration file and it needs to convert its Haskell types to TOML.
I’m not going to compare TOML with JSON/YAML/Dhall, as that is not in the scope of this blog post. Instead, let’s dive into the implementation details of the library.
Core architecture of Tomland
Below I’m going to mention the key architecture decisions behind tomland
:
-
A function-based approach for the description of how to convert Haskell data types to/from TOML instead of a typeclasses-based approach. This approach has a lot of pleasant advantages for decoding libraries: no orphan instances and no ambiguity when there are multiple options for conversion. Here
tomland
goes along with libraries like hedgehog , sv and waargonaut . - Two-phase approach: parsing (from text to intermediate AST) and decoding (from AST to custom user types). And similarly for pretty-printing.
-
Make invalid states unrepresentable: TOML supports arrays of values. But the TOML specification also forbids us from having heterogeneous arrays. This is captured at compile time by
tomland
. -
Parser combinators for parsing:
tomland
uses the modern parsing library megaparsec .
It was mentioned earlier that tomland
uses GADTs
to represent AST. Let’s take a closer look at this. When you want to convert unstructured runtime data to a GADT in Haskell you have to use a 2-step approach:
- Parse this data into a plain old ADT.
- Type check a simple ADT to a GADT.
TOML has several primitive types, but let’s just consider Integer
, Text
and Array
for simplicity. Here is how an ADT for untyped representation of TOML values looks like:
data UValue = UInteger Integer | UText Text | UArray [UValue]
And here is a GADT version of this data type, that guarantees that all the elements of the array have the same type (it uses extensions -XDataKinds
and -XGADTs
):
data TValue = TInteger | TText | TArray data Value (t :: TValue) where Integer :: Integer -> Value 'TInteger Text :: Text -> Value 'TText Array :: [Value t] -> Value 'TArray
Let’s assume that we have already implemented a parser for UValue
(instead of assuming you can also look at the real code
). Now we need to convert UValue
to Value
. But it’s not that simple: (1) this operation can fail and (2) at compile-time we don’t know the type of the resulted Value
.
To overcome the second challenge, we need to introduce an existential wrapper around Value
.
data AnyValue = forall (t :: TValue) . AnyValue (Value t)
NOTE: you can’t return Value t
from the function but you can return AnyValue
since the type variable t
is hidden inside the AnyValue
constructor.
Our conversion function can fail only because of a single reason: when elements of UArray
field have different constructors. The algorithm to check that all elements of the list have the same type is the following:
- Pattern match on the list. If the list is empty then this is a valid array.
- If the list is not empty then we assume that the type of the first element is the expected type of the whole array. So we just need to check every element of the list against this first element and ensure that all of them have the same type. But this needs to be done carefully in order to convince GHC that we checked everything properly.
First, let’s implement the sameValue
function that checks whether two Value
s have the same type. Here we’re going to use propositional equality
which is implemented in the following way:
data a :~: b where Refl :: a :~: a
And here is sameValue
:
data TypeMismatchError = TypeMismatchError { typeExpected :: TValue , typeActual :: TValue } valueType :: Value t -> TValue valueType (Integer _) = TInteger valueType (Text _) = TText valueType (Array _) = TArray sameValue :: Value a -> Value b -> Either TypeMismatchError (a :~: b) sameValue Integer{} Integer{} = Right Refl sameValue Text{} Text{} = Right Refl sameValue Array{} Array{} = Right Refl sameValue l r = Left $ TypeMismatchError { typeExpected = valueType l , typeActual = valueType r }
Now we’re ready to implement a proper simple type checking algorithm!
typeCheck :: UValue -> Either TypeMismatchError AnyValue typeCheck (UInteger n) = Right $ AnyValue $ Integer n typeCheck (UText s) = Right $ AnyValue $ Text s typeCheck (UArray a) = case a of [] -> Right $ AnyValue $ Array [] -- empty list is valid array x:xs -> do AnyValue v <- typeCheck x -- type check first element of array AnyValue . Array <$> checkElem v xs -- check `xs` against `v` where checkElem :: Value t -> [UValue] -> Either TypeMismatchError [Value t] checkElem v [] = Right [v] -- no values = no problems checkElem v (x:xs) = do AnyValue vx <- typeCheck x Refl <- sameValue v vx (v :) <$> checkElem vx xs
In order to convince GHC that the elements of the array have the same type, we pattern match on Refl
. After this pattern matching GHC can see the type equality between the given element and the head of the list. That’s why we can put them in the same list.
We managed to deal only with values. But in TOML, values are assigned to textual keys. Here is an example of this:
server.id = 42 server.host = "best-haskell-blog.com" server.ports = [8080, 8081, 9080, 9081]
Here server.ports
is a key and its corresponding value is an array of integers. Keys are represented in the following way in tomland
:
newtype Piece = Piece { unPiece :: Text } newtype Key = Key { unKey :: NonEmpty Piece }
Keys in TOML are dot-separated words. If some keys share a common prefix, they semantically belong to the same object (though you don’t have to follow this rule, it helps readability a lot). TOML also has tables that help to visually separate parts of the static configuration. Here is an example of tables and nested tables:
[server] [server.production] id = 42 host = "best-haskell-blog.com" ports = [8080, 8081] [server.staging] id = 777 host = "okayish-haskell-blog.com" ports = [9080, 9081]
To represent this structure nicely we need some prefix tree. There are several libraries on Hackage that implement very decent prefix trees but unfortunately, none of them fit our use-case. So we decided to implement our own data structure. It has the following shape (using two mutually recursive data structures):
type Prefix = Key type PrefixMap a = HashMap Piece (PrefixTree a) data PrefixTree a = Leaf Key a | Branch Prefix (Maybe a) (PrefixMap a)
When we want to insert a new key into PrefixTree
, we need to split the given key and the key inside the current tree node into common
and remaining
parts and then split the current node properly. The result of the routine that finds key prefixes is captured by the following data type:
data KeysDiff = Equal -- ^ Keys are equal. | NoPrefix -- ^ Keys don't have any common part. | FstIsPref -- ^ The first key is the prefix of the second one. !Key -- ^ Rest of the second key. | SndIsPref -- ^ The second key is the prefix of the first one. !Key -- ^ Rest of the first key. | Diff -- ^ Keys have a common prefix. !Key -- ^ Common prefix. !Key -- ^ Rest of the first key. !Key -- ^ Rest of the second key. -- | Takes two keys and returns their difference. keysDiff :: Key -> Key -> KeysDiff
Every TOML configuration stores the following items:
- Top-level keys.
- Zero or more tables.
- Zero or more array of tables.
The structure of TOML is recursive. So every table and array of tables has the same structure. That’s why a TOML AST is represented in the following way:
data TOML = TOML { tomlPairs :: HashMap Key AnyValue , tomlTables :: PrefixMap TOML , tomlTableArrays :: HashMap Key (NonEmpty TOML) }
That’s all regarding TOML AST! The following sections of this tutorial cover the bidirectional conversion aspect of the library.
Tagged partial bidirectional isomorphism
First, we need to learn how to switch between basic Haskell types and AnyValue
in both directions. Converting from Text
to AnyValue
is pretty simple. Just apply the Text
constructor of Value
and then wrap into AnyValue
. However, conversion from AnyValue
to Text
may fail because there might be a different constructor. It turns out that conversion to AnyValue
may also fail. Consider ByteString
: in order to convert ByteString
to AnyValue
you first need to convert it to Text
and this operation may fail. Alternatively, you can convert ByteString
to the array of integers.
You can see now that we need a way to convert types in both directions. Just like isomorphisms, with the only difference that the conversion in both directions may fail. Thus it’s partial. And we also want to report good error messages to our users. So this is actually tagged partial bidirectional isomorphism
. In tomland
this concept is captured by the following data type with a much shorter name:
data BiMap e a b = BiMap { forward :: a -> Either e b , backward :: b -> Either e a }
If we look at types as sets of values then BiMap
can be represented by the following graph:
Type parameters a
and b
represent the values between which we’re converting. And e
is the type of error.
NOTE:In tomland
BiMap
is specialized to the following error type:
data TomlBiMapError = WrongConstructor Text Text -- expected vs. actual | WrongValue MatchError | ArbitraryError Text type TomlBiMap = BiMap TomlBiMapError
When we want to write a bidirectional mapping between Text
and AnyValue
we need to provide two functions. If we want to do the same for ByteString
and AnyValue
we can implement two similar functions But you may notice that the logic for the Text
converter is a subset of the logic for the ByteString
one.
It turns out that BiMap
composes very nicely! And we need to write only BiMap
between ByteString
and Text
and compose it with BiMap
between Text
and AnyValue
. This is possible thanks to the following
Category
instance for BiMap
because BiMap
forms category. Category
allows us to compose objects of the type Type -> Type -> Type
, but for that, you need to implement the composition operator and the identity object which is a neutral element for composition.
instance Category (BiMap e) where id :: BiMap e a a id = BiMap Right Right (.) :: BiMap e b c -> BiMap e a b -> BiMap e a c bc . ab = BiMap { forward = forward ab >=> forward bc , backward = backward bc >=> backward ab }
This instance can be described clearly by the following picture:
It’s not convenient to use the dot-operator from Control.Category
to compose BiMap
s because this operator is already specialized to arrow (->)
in Prelude
. But we still can use the ‘>>>’ and ‘<<<’ operators for composing objects and this is quite nice!
_ByteStringText :: TomlBiMap ByteString Text _Text :: TomlBiMap Text AnyValue _ByteString :: TomlBiMap ByteString AnyValue _ByteString = _ByteStringText >>> _Text
tomland
implements a lot of BiMap
combinators out-of-the-box for you so you should be able to avoid writing your own mini-tomland inside your application if you want to use it on common data structures.
As of now, we are converting between primitive types and TOML values. But usually, in Haskell, we work with a lot of custom data types, including both, product and sum types. This section covers how to compose different BiMap
s into a single bidirectional converter for a Haskell data type.
The idea for bidirectional converters is not new. It’s based on the approach called monadic profunctors and is described in detail in this blog post:
And there is the codec
library which implements this approach for aeson
and binary
:
tomland
uses the same idea but different connecting pieces and expands this approach to support sum types, not only record types.
Monadic profunctors
Here is the core Codec
type for bidirectional conversion:
data Codec r w c a = Codec { codecRead :: r a , codecWrite :: c -> w a }
It looks scary but I will try to explain its meaning.
-
r
is usually someReader
monad. It stores an intermediate AST representation inside the context which we’re going to query to fetch pieces of our big custom user type. -
w
is someWriter
orState
which stores the result of converting a current piece of custom user type to an intermediate AST. -
a
is the type that we’re converting. For example,Integer
,Text
orPerson
. -
c
is a fake type variable which is required in order to make this magic work. It should be equal toa
but if we make ita
directly in the type definition, we won’t be able to write the required instances.
Codec
has Functor
, Applicative
, Monad
, Alternative
and Profunctor
instances.
NOTE:The Profunctor
instance is not implemented in tomland
because, currently, it requires the addition of the profunctors
package to our dependencies; this would make tomland
heavier, but it’s not that necessary to have this instance, it’s enough to just have functions from this instance, so we can wait until the profunctors
package is migrated to base
. At the time of writing, there is an open GHC ticket for this:
The type of codecWrite
semantically should be equivalent to a -> w ()
: we take a value of our type a
, change the context, and ignore the result. But if we do so, we’re not able to implement a desirable, convenient interface.
Another way to look at Codec
: it is just a special case of Product
monad. Here is a quick reminder of what the Product
data type is:
data Product f g a = Pair (f a) (g a) instance (Monad f, Monad g) => Monad (Product f g) where Pair m n >>= f = Pair (m >>= fstP . f) (n >>= sndP . f) where fstP (Pair a _) = a sndP (Pair _ b) = b
So Codec
is semantically equivalent to the following Product
specialization:
type Codec r w c a = Product r (ReaderT c w) a
Product
stores two monadic actions and performs operations over them in parallel. So at one step of >>=
it changes both monads. But the changes are parallel and independent. This means that you can describe two monadic actions in a single place, but later you can choose to work with only one of them.
It’s an interesting (and not that difficult) exercise to implement the Functor
, Applicative
, and Monad
instances for Codec
. So let’s look closer at dimap
from profunctor part:
dimap :: (Functor r, Functor w) => (c -> d) -- ^ Mapper for consumer -> (a -> b) -- ^ Mapper for producer -> Codec r w d a -- ^ Source 'Codec' object -> Codec r w c b -- ^ Target ‘Codec’ object dimap f g codec = Codec { codecRead = g <$> codecRead codec , codecWrite = fmap g . codecWrite codec . f }
In real life we’re going to substitute a
in place of c
:
type BiCodec r w a = Codec r w a a
So, a simpler version of dimap
for BiCodec
looks like this:
dimap :: (Functor r, Functor w) => (b -> a) -> (a -> b) -> BiCodec r w a -> BiCodec r w b dimap f g codec = Codec { codecRead = g <$> codecRead codec , codecWrite = fmap g . codecWrite codec . f }
Now the idea behind Profunctor
becomes much clearer! If we have a bidirectional converter for a value of type a
and we have two pure functions to convert from a
to b
and vice versa we can have a bidirectional converter for b
.
There’s only a single important piece left: a convenient operator to compose different Codec
s:
infixl 5 .= (.=) :: Codec r w field a -> (object -> field) -> Codec r w object a codec .= getter = codec { codecWrite = codecWrite codec . getter }
An attentive reader may notice that the (.=)
operator is lmap
from Profunctor
typeclass.
We covered only general data types but it’s beneficial for our understanding to see how it is specialized to write bidirectional codecs for TOML:
type Env = ExceptT DecodeException (Reader TOML) type St = MaybeT (State TOML) type TomlCodec = BiCodec Env St
Our r
variable stores a a TOML AST inside the context and can return an error. So r a
in Codec
specializes to:
codecRead @tomland :: TOML -> Either DecodeException a
And St
stores produced TOML in context. All monadic actions will insert keys inside this AST and later it can be converted to just Text
. So it’s equivalent to the following type:
codecWrite @tomland :: a -> TOML -> (Maybe a, TOML)
NOTE: TOML
data type from tomland
implements Semigroup
and Monoid
instances, so we could use Writer
monad instead of State
. But it’s
more efficient to use State
.
Now we need a function that can lift TomlBiMap
to TomlCodec
. But this function requires a key. It will lookup TOML
AST to find AnyValue
under given Key
and then it will try to convert AnyValue
according to the given BiMap
. This function is simply called match
in tomland
:
match :: TomlBiMap a AnyValue -> Key -> TomlCodec a
Now we can write helper functions to make the interface easier to work with:
integer :: Key -> TomlCodec Integer integer = match _Integer text :: Key -> TomlCodec Text text = match _Text arrayOf :: TomlBiMap a AnyValue -> Key -> TomlCodec [a] arrayOf = match . _Array
Note how we managed to decompose the problem of finding a value under a key and converting this value to Haskell type into two separate problems. BiMap
takes care of all errors during conversion between types and doesn’t know anything about TOML and keys. And match
doesn’t know what value we’re working with (because of parametric polymorphism), all it does is, it just looks up the required key inside the map and handles the two resulting cases: when the key is found and when it is not. For pretty-printing, match
inserts the value under the given key.
Tables in TOML can be used to represent codecs for nested data structures. Remember the example with the production
and staging
TOML servers? Here is how the Codec
for this data type looks like:
data Settings = Settings { settingsId :: Int , settingsHost :: Text , settingsPorts :: [Int] } settingsCodec :: TomlCodec Settings settingsCodec = Settings <$> Toml.int "id" .= settingsId <*> Toml.text "host" .= settingsHost <*> Toml.arrayOf Toml._Int "ports" .= settingsPorts data ServerConfig = ServerConfig { serverConfigProduction :: Settings , serverConfigStaging :: Settings } serverConfigCodec :: TomlCodec ServerConfig serverConfigCodec = flip Toml.table "server" $ ServerConfig <$> Toml.table settingsCodec "production" .= settingsProduction <*> Toml.table settingsCodec "staging" .= settingsStaging
You may notice that most of the time we’re using Applicative
instance of Codec
, despite the fact that this approach is called monadic profunctors
. But, it turns out that applicative functors are more convenient here. Though, of course, you can have a lot of fun with the Monad
instance as well. For example, configuration can contain a field with a key name like spec-version
and the value of this field determines how this particular configuration should be parsed.
And here are a couple of examples with possible error messages:
ghci> showRes = either print print ghci> showRes $ decode (arrayOf _Int "a") "a = [2, 'foo']" tomland decode error: Parse error during conversion from TOML to custom user type: 1:9: | 1 | a = [2, 'foo'] | ^ unexpected ''' expecting ',', ']', or integer ghci> showRes $ decode (arrayOf _Int "b") "a = [2, 3]" tomland decode error: Key b is not found ghci> showRes $ decode (arrayOf _Int "a") "a = [2, 3]" [2,3] ghci> showRes $ decode (arrayOf _Int "a") "a = ['foo', 'bar']" tomland decode error: Invalid constructor * Expected: TInteger * Actual: Text "foo"
As an additional flavour, tomland
has the implementation of EDSL for specifying TOML objects. The implementation is very short.
type TDSL = State TOML () mkToml :: TDSL -> TOML mkToml env = execState env mempty (=:) :: Key -> Value a -> TDSL (=:) k v = modify $ insertKeyVal k v table :: Key -> TDSL -> TDSL table k = modify . insertTable k . mkToml
However, we need a couple of extra instances to make the API more convenient to work with. Due to the fact that our Value
is a GADT, we can implement type-safe instances for sum types!
instance (t ~ 'TInteger) => Num (Value t) where (Integer a) + (Integer b) = Integer $ a + b (Integer a) * (Integer b) = Integer $ a * b abs (Integer a) = Integer (abs a) signum (Integer a) = Integer (signum a) fromInteger = Integer negate (Integer a) = Integer (negate a) instance (t ~ 'TText) => IsString (Value t) where fromString = Text . fromString @Text
NOTE:Here we’re using type equality constraint trick in order to provide better type inference.
With these tools we’re able to specify TOML files in a safer way:
serverToml :: TOML serverToml = table "server" $ do table "production" $ do "id" =: 42 "host" =: "best-haskell-blog.com" "ports" =: Array [8080, 8081] table "staging" $ do "id" =: 777 "host" =: "okayish-haskell-blog.com" "ports" =: Array [9080, 9081]
This is very useful for testing when you want to test conversion but you don’t want to deal with parsing. With EDSL you can create a TOML AST which is guaranteed to be correct and safe by the construction.
You may notice that tomland
was written with the help of a lot of heavy tools. But types don’t substitute tests, so we have A LOT
of them. I will shortly describe what we test in tomland
to assure you that we care very much about the correctness of this library. tomland
is already used in production and in commercial projects so it’s necessary to have reliable tests.
- Unit tests for parsing. The TOML specification is relatively small, but still, there’re a lot of things to keep an eye on. We have ~600 LOC of unit tests for parsing only. And we test not only that we parse correct TOML (files) successfully, but also that we parse incorrect TOML files unsuccessfully (with good error-reporting).
-
Unit tests for
PrefixTree
onlookup
andinsert
. -
Property tests for
PrefixTree
oninsert
withlookup
. -
Property tests for
PrefixTree
forSemigroup
laws. -
Property tests for TOML on
Semigroup
andMonoid
laws. - Roundtrip property tests for parsing and printing TOML AST.
- Roundtrip property tests for decoding and encoding TOML.
-
Roundtrip property tests for every
BiMap
. - Golden unit tests for pretty-printing.
And we also have benchmarks to compare tomland
with other libraries! Not all libraries are being actively maintained, so there were some difficulties in building them on newer GHC versions, therefore we have only compared tomland
with htoml
, htoml-megaparsec
, and toml-parser
. Here is the performance table:
Library | parse :: Text -> AST | transform :: AST -> Haskell |
---|---|---|
tomland | 305.5 μs | 1.280 μs |
htoml | 852.8 μs | 33.37 μs |
htoml-megaparsec | 295.0 μs | 33.62 μs |
toml-parser | 164.6 μs | 1.101 μs |
You may see that tomland
is not the fastest one (though still very fast), but performance hasn’t been optimized so far and:
-
toml-parser
doesn’t support arrays of tables and because of that it’s hardly possible to specify a list of custom data types in TOML with this library. -
tomland
supports the latest TOML spec whilehtoml
andhtoml-megaparsec
don’t have support for all types, values and formats. -
tomland
is the only library that has pretty-printing. -
toml-parser
doesn’t have ways to convert a TOML AST to custom Haskell types andhtoml*
libraries use a typeclasses-based approach viaaeson
library. -
tomland
is bidirectional :)
I hope that this blog post helps you to better understand some advanced topics and Haskell in general. You can see how very different pieces of Haskell can be combined inside a single library and can help to come up with a better architecture. Probably you can borrow some ideas from tomland
for your next library. Or even give tomland
a try for static configuration in your application!
I want to say special thanks to vrom911
, willbasky
, jiegillet
and ghallak
for their massive help with the tomland
library! Their contribution is invaluable.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
深入理解程序设计
[美] Jonathan Bartlett / 郭晴霞 / 人民邮电出版社 / 2014-1 / 49.00
是否真正理解汇编语言,常常是普通程序员和优秀程序员的分水岭。《深入理解程序设计:使用Linux汇编语言》介绍了Linux平台下的汇编语言编程,教你从计算机的角度看问题,从而了解汇编语言及计算机的工作方式,为成就自己的优秀程序员之梦夯实基础。 很多人都认为汇编语言晦涩难懂,但New Medio技术总监Jonathan Bartlett的这本书将改变人们的看法。本书首先介绍计算机的体系结构,然后......一起来看看 《深入理解程序设计》 这本书的介绍吧!