caffe源码阅读(1): 数据加载

栏目: 数据库 · 发布时间: 7年前

内容简介：caffe源码阅读(1): 数据加载

训练模型的第一步是加载数据到内存中，本文从caffe官方例子MNIST一起来看下在caffe中数据如何从文本进入内存的本文假设读者已经对protobuf,glog,gflags等框架有一定了解，如不熟悉请自行百度简单用法。

概述

caffe的数据不是直接从文本到内存的，而是通过持久化的数据库存储结构（如leveldb或者lmdb）中获取数据加载到内存的。所以我们的数据流向是文本数据（二进制数据）->levelDB(lmdb)->memory。

文本->levelDB(convert_mnist_data.cpp)

从proto文件中可以了解到我们要把数据转换成怎么样的类型格式

channels：表示一张图有几种表示方式，比如RGB3张图的话就是3
height，weight：表示图像高度以及宽度

data：样本数据，表示feature

30 message Datum {
31   optional int32 channels = 1;
32   optional int32 height = 2;
33   optional int32 width = 3;
34   // the actual image data, in bytes
35   optional bytes data = 4;
36   optional int32 label = 5;
37   // Optionally, the datum could also hold float data.
38   repeated float float_data = 6;
39   // If true data contains an encoded image that need to be decoded
40   optional bool encoded = 7 [default = false];
41 }

从脚本download得到的文件是二进制文本，首先校验下二进制文本是否是约定好的二进制数据，有读写过二进制文本的同学肯定对这个印象比较深刻，如果没有校验就开始读取的话，代码还是一样能跑，但是产出的数据完全就是另一个东西了。这一步是必要的—保证数据的正确性。
```
54   image_file.read(reinterpret_cast<char*>(&magic), 4);
55   magic = swap_endian(magic);
56   CHECK_EQ(magic, 2051) << "Incorrect image file magic.";
57   label_file.read(reinterpret_cast<char*>(&magic), 4);
58   magic = swap_endian(magic);
59   CHECK_EQ(magic, 2049) << "Incorrect label file magic.";
```

读取数据行数、图片大小等参数

60   image_file.read(reinterpret_cast<char*>(&num_items), 4);
61   num_items = swap_endian(num_items);
62   label_file.read(reinterpret_cast<char*>(&num_labels), 4);
63   num_labels = swap_endian(num_labels);
64   CHECK_EQ(num_items, num_labels);
65   image_file.read(reinterpret_cast<char*>(&rows), 4);
66   rows = swap_endian(rows);
67   image_file.read(reinterpret_cast<char*>(&cols), 4);
68   cols = swap_endian(cols);

读取图片数据以及label存入protobuf定义好的数据结构中，序列化成字符串储存到数据库中,这里为了减少单次操作带来的带宽成本（验证数据包完整等），每1000次执行一次操作

120   for (int item_id = 0; item_id < num_items; ++item_id) {                                                                                                      
121     image_file.read(pixels, rows * cols);
122     label_file.read(&label, 1);
123     datum.set_data(pixels, rows*cols);
124     datum.set_label(label);
125     string key_str = caffe::format_int(item_id, 8);
126     datum.SerializeToString(&value);
127  
128     // Put in db
129     if (db_backend == "leveldb") {  // leveldb
130     ┊ batch->Put(key_str, value);

141        
142     if (++count % 1000 == 0) {
143     ┊ // Commit txn
144     ┊ if (db_backend == "leveldb") {  // leveldb
145     ┊   db->Write(leveldb::WriteOptions(), batch);
146     ┊   delete batch;
147     ┊   batch = new leveldb::WriteBatch();
148     ┊ }
...
159   if (count % 1000 != 0) {
160     if (db_backend == "leveldb") {  // leveldb
161     ┊ db->Write(leveldb::WriteOptions(), batch);
162     ┊ delete batch;
163     ┊ delete db;
164     }

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Hit Refresh

Satya Nadella、Greg Shaw / HarperBusiness / 2017-9-26 / USD 20.37

Hit Refresh is about individual change, about the transformation happening inside of Microsoft and the technology that will soon impact all of our lives—the arrival of the most exciting and disruptive......一起来看看《Hit Refresh》这本书的介绍吧!

码农工具

caffe源码阅读(1): 数据加载

概述

文本->levelDB(convert_mnist_data.cpp)

Hit Refresh

RGB转16进制工具

XML 在线格式化

RGB CMYK 转换工具