Get a Brain

栏目: IT技术 · 发布时间: 4年前

内容简介:Let’s build a neural network from scratch. Our artificial brain should run on just the core Haskell system.We follow the free online book

Let’s build a neural network from scratch. Our artificial brain should run on just the core Haskell system.

We follow the free online book Neural Networks and Deep Learning by Michael Nielsen. We will train a network to recognize handwritten digits, specifically those in the MNIST database of handwritten digits .

Give me your digits

Let us first dispense with the unpleasantries. While our neural network shall run on the base install, to read the input data we must venture slightly further because the database files are gzip-compressed . Accordingly, we run:

$ cabal install zlib

We download four input files:

The MNIST page describes their format, which the following program can read:

import Codec.Compression.GZip (decompress)
import qualified Data.ByteString.Lazy as BS
import Data.Functor
import System.Random

render n = let s = " .:oO@" in s !! (fromIntegral n * length s `div` 256)

main = do
  s <- decompress <$> BS.readFile "train-images-idx3-ubyte.gz"
  l <- decompress <$> BS.readFile "train-labels-idx1-ubyte.gz"
  n <- (`mod` 60000) <$> randomIO
  putStr . unlines $
    [(render . BS.index s . (n*28^2 + 16 + r*28 +)) <$> [0..27] | r <- [0..27]]
  print $ BS.index l (n + 8)

A sample run:

:@@@@:.
           :O@@@@@@O
       .::@@@@@@@@@@:
     OO@@@@@@@@Oo@@@@
    :OO@@@@@:    O@@@
       OO        @@@o
                O@@@.
               .@@@O
              .@@@@:
              @@@@:
             :@@@:
             O@@.
            :@@O
           .@@@.
           O@@.
          o@@@
          @@@O
          @@O
          @@@.
          :@O

7

Ample Samples

Later on, we will need to sample from a normal distribution. Packages exist for this, but we stay within core Haskell by implementing the Box-Muller transform . Though our version is inefficient, it’s good enough for our purposes.

gauss :: Float -> IO Float
gauss stdev = do
  x1 <- randomIO
  x2 <- randomIO
  return $ stdev * sqrt (-2 * log x1) * cos (2 * pi * x2)

Weight Training

A single neuron takes in a bunch of floats from other neurons or directly from the input data, multiplies each by a weight , and sums them. It adds this total to a bias , passes it through the activation function , and outputs the result.

This is easier to state with code. Let ws be the weights, b be the bias, and as be the inputs, and f be the activation function. Then the output is:

f (sum (zipWith (*) ws as) + b)

We choose the rectifier ( ReLU ) to be our activation function. This differs from Nielsen, who uses the sigmoid function, and it means we should initialize our biases differently.

relu = max 0

Like Nielsen, our network has an input layer of 28 2 = 784 raw numbers which feed into a hidden layer of 30 neurons, which in turn feeds into an output layer of 10 neurons. Each pair of adjacent layers has full mesh topology, namely, every node of one layer is connected to every node of the next.

The input values lie in [0,1] and indicate the darkness of a pixel.

The output neurons give nonnegative values: we’ll train the nth neuron to output at least 1 if the digit is n, and 0 otherwise.

This scheme results in a peculiar cost function. If we had also chosen the sigmoid function, then our outputs would be bounded above by 1 and we could train our network to aim for exactly 1. But we’ve chosen relu , whose outputs can be arbitarily large. Inituitively, I feel it makes sense to interpret anything that is at least 1 as a strong positive indication, and we should train for 0 otherwise. I don’t know what the literature recommends.

For each layer of neurons, we store the biases in a list of floats [Float] : the ith float is the bias for the ith neuron. Similarly, we store the weights in a two-dimensional matrix of floats [[Float]] : the ith row holds the weights of the inputs to the ith node. Since our neural network is a bunch of layers, we store it as a list of biases and weights for each layer: [([Float], [[Float]])] .

We initialize all the biases to 1, and the weights to come from a normal distribution with mean 0 and standard deviation 0.01.

newBrain :: [Int] -> [([Float], [[Float]])]
newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
  zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts

main = do
  putStrLn "3 inputs, a hidden layer of 4 neurons, and 2 output neurons:"
  print $ newBrain [3, 4, 2]

We’ll want to hang on to the values before they are fed to the activation functions; these are called the weighted inputs . We compute an entire layer of weighted inputs from the previous layer as follows:

zLayer :: [Float] -> ([Float], [[Float]]) -> [Float]
zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs

I’ve used wvs to suggest “weight vectors”, to remind us this is a list of rows of a matrix.

We can run the activation function on the weighted inputs to compute the actual outputs with (relu <$>) . A suitable fold takes us from the start of the neural network to the end:

feed :: [Float] -> [([Float], [[Float]])] -> [Float]
feed = foldl' (((relu <$>) . ) . zLayer)

-- Generate a neural network, and supply the inputs [0.1, 0.2, 0.3].
main = newBrain [3, 4, 2] >>= print . feed [0.1, 0.2, 0.3]

Lesson Plan

We train the weights and biases using gradient descent , so we need the derivative of the activation function (with its discontinuity filled in):

relu' | x < 0     = 0
      | otherwise = 1

We use the same desired output vector format as Nielsen, that is, if a given training example is labeled with the digit i , then we set the ith component to 1, and the others to 0.

However, since we consider any value greater than 1 to play the same role as 1, in our output vectors, our derivative of the cost with respect to the output value is:

dCost a y | y == 1 && a >= y = 0
          | otherwise        = a - y

Now, backpropagation requires the weighted inputs and activations of every neuron, that is, the values just before and after we run the activation function during feedforward. Because backpropagation goes through the layers in reverse order, it helps to have these values in reverse order, which leads to the following awkwardly-named utility function:

-- xv: vector of inputs
-- Returns a list of (weighted inputs, activations) of each layer,
-- from last layer to first.
revaz :: [Float] -> [([Float], [[Float]])] -> ([[Float]], [[Float]])
revaz xv = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
  zs' = zLayer av (bs, wms) in (((relu <$> zs'):avs), (zs':zs))) ([xv], [])

From these values we compute the output error delta0 , and backpropagate:

-- xv: vector of inputs
-- yv: vector of desired outputs
-- Returns list of (activations, deltas) of each layer in order.
deltas :: [Float] -> [Float] -> [([Float], [[Float]])] -> ([[Float]], [[Float]])
deltas xv yv layers = let
  (avs@(av:_), zv:zvs) = revaz xv layers
  delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
  in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0]) where
    f _ [] dvs = dvs
    f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
      zipWith (*) [(sum $ zipWith (*) row dv) | row <- wm] (relu' <$> zv)

We opt for online learning , in other words, we modify our weights and biases immediately after processing a case, rather than batching the changes. Our learning rate is 0.002:

eta = 0.002

descend av dv = zipWith (-) av ((eta *) <$> dv)

learn :: [Float] -> [Float] -> [([Float], [[Float]])] -> [([Float], [[Float]])]
learn xv yv layers = let (avs, dvs) = deltas xv yv layers
  in zip (zipWith descend (fst <$> layers) dvs) $
    zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av)) wvs dv)
      (snd <$> layers) avs dvs

When trained on the first 10000 cases, our neural network correctly classifies about 80% of the test cases.

The Numbers Game

The following program assumes the four input MNIST database files reside in the current directory. We randomly select a test case as an example and print it. We then show how training pays off with crude graphs that represent the neural network’s response to our example. Finally, we try all 10000 test cases and count the number we get right.

The neural network takes about 40 lines and compiles on a base Haskell system. The code to read the data, train the network, and print results at certain checkpoints takes about another 40 lines.

import Codec.Compression.GZip (decompress)
import qualified Data.ByteString.Lazy as BS

import Control.Monad
import Data.Functor
import Data.Ord
import Data.List
import System.Random

gauss scale = do
  x1 <- randomIO
  x2 <- randomIO
  return $ scale * sqrt (-2 * log x1) * cos (2 * pi * x2)

newBrain :: [Int] -> IO [([Float], [[Float]])]
newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
  zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts

relu = max 0
relu' x | x < 0      = 0
        | otherwise  = 1

zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs

feed = foldl' (((relu <$>) . ) . zLayer)

revaz xs = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
  zs' = zLayer av (bs, wms) in ((relu <$> zs'):avs, zs':zs)) ([xs], [])

dCost a y | y == 1 && a >= y = 0
          | otherwise        = a - y

deltas xv yv layers = let
  (avs@(av:_), zv:zvs) = revaz xv layers
  delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
  in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0]) where
    f _ [] dvs = dvs
    f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
      zipWith (*) [sum $ zipWith (*) row dv | row <- wm] (relu' <$> zv)

eta = 0.002

descend av dv = zipWith (-) av ((eta *) <$> dv)

learn xv yv layers = let (avs, dvs) = deltas xv yv layers
  in zip (zipWith descend (fst <$> layers) dvs) $
    zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av)) wvs dv)
      (snd <$> layers) avs dvs

getImage s n = fromIntegral . BS.index s . (n*28^2 + 16 +) <$> [0..28^2 - 1]
getX     s n = (/ 256) <$> getImage s n
getLabel s n = fromIntegral $ BS.index s (n + 8)
getY     s n = fromIntegral . fromEnum . (getLabel s n ==) <$> [0..9]

render n = let s = " .:oO@" in s !! (fromIntegral n * length s `div` 256)

main = do
  [trainI, trainL, testI, testL] <- mapM ((decompress  <$>) . BS.readFile)
    [ "train-images-idx3-ubyte.gz"
    , "train-labels-idx1-ubyte.gz"
    ,  "t10k-images-idx3-ubyte.gz"
    ,  "t10k-labels-idx1-ubyte.gz"
    ]
  b <- newBrain [784, 30, 10]
  n <- (`mod` 10000) <$> randomIO
  putStr . unlines $
    take 28 $ take 28 <$> iterate (drop 28) (render <$> getImage testI n)

  let
    example = getX testI n
    bs = scanl (foldl' (\b n -> learn (getX trainI n) (getY trainL n) b)) b [
     [   0.. 999],
     [1000..2999],
     [3000..5999],
     [6000..9999]]
    smart = last bs
    cute d score = show d ++ ": " ++ replicate (round $ 70 * min 1 score) '+'
    bestOf = fst . maximumBy (comparing snd) . zip [0..]

  forM_ bs $ putStrLn . unlines . zipWith cute [0..9] . feed example

  putStrLn $ "best guess: " ++ show (bestOf $ feed example smart)

  let guesses = bestOf . (\n -> feed (getX testI n) smart) <$> [0..9999]
  let answers = getLabel testL <$> [0..9999]
  putStrLn $ show (sum $ fromEnum <$> zipWith (==) guesses answers) ++
    " / 10000"

Let’s walk through a sample run. The program first randomly picks a reasonably legible “9”. Without any training, the 9th output is large, but sadly, so is every other output.

.o@@@@@:
           o@@@@@@@@
          o@@@@@@@@@:
         .@@@O. .@@@.
         :@@:  :@@@@
         :@@@@@@@@@@
          O@@@@@@@@:
           :@@@@@@O
               O@@o
               @@@
              :@@O
              O@@:
             .@@@
             O@@o
            :@@@.
           .@@@o
           o@@O
           @@@:
          .@@@
           o@:

0: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
6: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
7: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
8: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
9: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

After 1000 iterations, all outputs are lower. Trouble is brewing apparently: the 9th output is tiny, and there are 6 relatively large outputs.

0: ++++++
1: +++++
2:
3: +++++++++
4: +++++++++
5: ++
6: +++++++++++
7: ++++++++++++
8: +
9: +

After 2000 more iterations, we see marked improvement: now the most likely contenders are 7 and 9, though it likes 7 more than 9.

0:
1: +++++
2:
3: +++
4: +++++++++++
5: +++++++++++
6:
7: ++++++++++++++++++++++++++++++++++
8: +++++
9: +++++++++++++++++++

By the 10000th iteration, the 9th output has caught up to the 7th. In fact, the 9th output is larger, but the graph is too coarse to show this. We wind up classifying our example correctly.

Out of the 10000 test cases, we classify 8467 cases correctly.

0:
1:
2:
3: +++++++
4:
5:
6:
7: +++++++++++++++++++++++++++++++++
8:
9: +++++++++++++++++++++++++++++++++

best guess: 9
8467 / 10000

Beginner’s Luck

This project was guided by expedience. I wanted to code a neural network from scratch, but I also wanted it running as soon as possible, so I cut corners.

I chose online learning so I could avoid code for batching deltas. My code also trains on the data in the order of appearance.

I chose ReLU because I heard it can converge faster that the sigmoid. This meant I was unable to mimic Nielsen’s code for picking weights and biases, because a ReLU neuron that is initially likely to output zero will be hard to train. Luckily, a brief web search revealed that we can set all the initial biases to 1. As for the weights, the normal distribution around 0 with standard deviation 0.01 felt about right.

I stored matrices in Haskell lists: great for economy of expression, but not so great for performance. To compensate, the code only trains for one session, and on only one sixth of the data. To further alleviate my impatience, my program gives updates on its progress.

Despite so many compromises, the above reliably produces neural networks that correctly classify roughly 75% to 85% of the test cases, taking about a minute to run on my laptop.

Haste Makes Sloth

I obtained better results by randomizing the order of the training cases and feeding them all to the network, though sometimes the results were worse. One run produced a neural net scoring 9202 out of 10000.

I made a webpage to show this off , complete with a crude editor.

I had hoped to code in Haskell exclusively, but the Haste compiler introduced too much overhead. I resorted to moving both the neural net feedforward routine and the sample image data directly into JavaScript. I wonder if Fay can help.

The souce is available in the following Git repository.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

算法:C语言实现

算法:C语言实现

塞奇威克 / 机械工业出版社 / 2006-9 / 69.00元

本书是Sedgewick彻底修订和重写的C算法系列的第一本。全书分为四部分,共16章,第一部分“基础知识”(第1-2章)介绍基本算法分析原理。第二部分“数据结构”(第3-5章)讲解算法分析中必须掌握的数据结构知识,主要包括基本数据结构,抽象数据结构,递归和树。一起来看看 《算法:C语言实现》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

html转js在线工具
html转js在线工具

html转js在线工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具