内容简介:本文对MultiSite内部数据结构和流程做一些梳理,加深对RadosGW内部流程的理解。因为MultiSite如何搭建,网上有较多的资料,因此不再赘述。本文中创建的zonegroup为xxxx,两个zone:zonegroup相关的信息如下:
前言
本文对MultiSite内部数据结构和流程做一些梳理,加深对RadosGW内部流程的理解。因为MultiSite如何搭建,网上有较多的资料,因此不再赘述。
本文中创建的zonegroup为xxxx,两个zone:
- master
- secondary
zonegroup相关的信息如下:
{ "id": "9908295f-d8f5-4ac3-acd7-c955a177bd09", "name": "xxxx", "api_name": "", "is_master": "true", "endpoints": [ "http:\/\/s3.246.com\/" ], "hostnames": [], "hostnames_s3website": [], "master_zone": "8aa27332-01da-486a-994c-1ce527fa2fd7", "zones": [ { "id": "484742ba-f8b7-4681-8411-af96ac778150", "name": "secondary", "endpoints": [ "http:\/\/s3.243.com\/" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false" }, { "id": "8aa27332-01da-486a-994c-1ce527fa2fd7", "name": "master", "endpoints": [ "http:\/\/s3.246.com\/" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false" } ], "placement_targets": [ { "name": "default-placement", "tags": [] } ], "default_placement": "default-placement", "realm_id": "0c4b59a1-e1e7-4367-9b65-af238a2f145b" }
相关的pool
数据池(data pool)和索引池(index pool)
首当其中的pool是数据pool,即用户上传的对象数据,最终存放的地点:
root@NODE-246:/var/log/ceph# radosgw-admin zone get { "id": "8aa27332-01da-486a-994c-1ce527fa2fd7", "name": "master", "domain_root": "default.rgw.data.root", "control_pool": "default.rgw.control", "gc_pool": "default.rgw.gc", "log_pool": "default.rgw.log", "intent_log_pool": "default.rgw.intent-log", "usage_log_pool": "default.rgw.usage", "user_keys_pool": "default.rgw.users.keys", "user_email_pool": "default.rgw.users.email", "user_swift_pool": "default.rgw.users.swift", "user_uid_pool": "default.rgw.users.uid", "system_key": { "access_key": "B9494C9XE7L7N50E9K2V", "secret_key": "O8e3IYV0gxHOwy61Og5ep4f7vQWPPFPhqRXjJrYT" }, "placement_pools": [ { "key": "default-placement", "val": { "index_pool": "default.rgw.buckets.index", "data_pool": "default.rgw.buckets.data", "data_extra_pool": "default.rgw.buckets.non-ec", "index_type": 0 } } ], "metadata_heap": "", "realm_id": "0c4b59a1-e1e7-4367-9b65-af238a2f145b" }
从上面可以看出,master zone的default-placement中
作用 | pool name |
---|---|
data pool | default.rgw.buckets.data |
index pool | default.rgw.buckets.index |
data extra pool | default.rgw.buckets.non-ec |
测试版本是Jewel,尚不支持index 动态shard,我们选择index max shards=8,即每个bucket 有8个分片。
rgw_override_bucket_index_max_shards = 8
通过如下指令可以看到我们当前集群的bucket信息:
root@NODE-246:/var/log/ceph# radosgw-admin bucket stats [ { "bucket": "segtest2", "pool": "default.rgw.buckets.data", "index_pool": "default.rgw.buckets.index", "id": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769", "marker": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769", "owner": "segs3account", ... }, { "bucket": "segtest1", "pool": "default.rgw.buckets.data", "index_pool": "default.rgw.buckets.index", "id": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768", "marker": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768", "owner": "segs3account", ... } }
从上图可以看到,一共有两个bucket,bucket id分别是:
bucket name | bucket id |
---|---|
segtest1 | 8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768 |
segtest2 | 8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769 |
每个bucket有8个index shards,共有16个对象。
root@NODE-246:/var/log/ceph# rados -p default.rgw.buckets.index ls .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.7 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.0 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.2 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.5 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.6 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.3 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.4 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.3 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.0 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.1 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.6 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.5 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.2 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.1 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.4 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.7
default.rgw.log pool
log pool记录的是各种日志信息,对于MultiSite这种使用场景,我们可以从default.rgw.log pool中找到这种命名的对象:
root@NODE-246:~# rados -p default.rgw.log ls |grep data_log data_log.0 data_log.11 data_log.12 data_log.8 data_log.14 data_log.13 data_log.10 data_log.9 data_log.7
一般来讲这种命名风格的对象最多有rgw_data_log_num_shards,对于我们的场景:
OPTION(rgw_data_log_num_shards, OPT_INT, 128)
rgw_bucket.h中可以看到如下代码:
num_shards = cct->_conf->rgw_data_log_num_shards; oids = new string[num_shards]; string prefix = cct->_conf->rgw_data_log_obj_prefix; if (prefix.empty()) { prefix = "data_log"; } for (int i = 0; i < num_shards; i++) { char buf[16]; snprintf(buf, sizeof(buf), "%s.%d", prefix.c_str(), i); oids[i] = buf; } renew_thread = new ChangesRenewThread(cct, this); renew_thread->create("rgw_dt_lg_renew")
一般来讲,该对象内容为空,相关有用的信息,都记录在omap中:
root@NODE-246:~# rados -p default.rgw.log stat data_log.61 default.rgw.log/data_log.61 mtime 2018-12-10 14:39:38.000000, size 0 root@NODE-246:~# rados -p default.rgw.log listomapkeys data_log.61 1_1544421980.298394_2914.1 1_1544422002.458109_2939.1 ... 1_1544423969.748641_4486.1 1_1544423978.090683_4495.1 1_1544424000.286801_4507.1
写入对象
概述
宏观上讲,上传一个对象到bucket中,需要写多个地方,如果同时打开了bi log和data log的话。
-
default.rgw.buckets.data : 将真实数据写入此pool,一般来讲新增一个
_ 的对象 - default.rgw.buckets.index: 当数据写入完成之后,在该对象对应的bucket index shard的omap中增加该对象的信息
- bucket index 对象的omap中增加bi log
- default.rgw.log pool中的data_log对象的omap中增加data log
bi log
当上传对象完毕之后,我们查看bucket index shard,可以看到如下内容:
root@node247:/var/log/ceph# rados -p default.rgw.buckets.index listomapkeys .dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 oem.tar.bz2 0_00000000001.1.2 0_00000000002.2.3
其中oem.tar.bz2文件是我们上传的对象,略过不提,除此意外还有两个0_00000000001.1.2和0_00000000002.2.3对象。
key (18 bytes): 00000000 80 30 5f 30 30 30 30 30 30 30 30 30 30 31 2e 31 |.0_00000000001.1| 00000010 2e 32 |.2| 00000012 value (133 bytes) : 00000000 03 01 7f 00 00 00 0f 00 00 00 30 30 30 30 30 30 |..........000000| 00000010 30 30 30 30 31 2e 31 2e 32 0b 00 00 00 6f 65 6d |00001.1.2....oem| 00000020 2e 74 61 72 2e 62 7a 32 00 00 00 00 00 00 00 00 |.tar.bz2........| 00000030 01 01 0a 00 00 00 88 ff ff ff ff ff ff ff ff 00 |................| 00000040 30 00 00 00 31 39 63 62 66 32 35 30 2d 62 62 33 |0...19cbf250-bb3| 00000050 65 2d 34 62 38 63 2d 62 35 62 66 2d 31 61 34 30 |e-4b8c-b5bf-1a40| 00000060 64 61 36 36 31 30 66 65 2e 31 35 30 38 33 2e 36 |da6610fe.15083.6| 00000070 34 32 31 30 00 00 01 00 00 00 00 00 00 00 00 00 |4210............| 00000080 00 00 00 00 00 |.....| 00000085 key (18 bytes): 00000000 80 30 5f 30 30 30 30 30 30 30 30 30 30 32 2e 32 |.0_00000000002.2| 00000010 2e 33 |.3| 00000012 value (125 bytes) : 00000000 03 01 77 00 00 00 0f 00 00 00 30 30 30 30 30 30 |..w.......000000| 00000010 30 30 30 30 32 2e 32 2e 33 0b 00 00 00 6f 65 6d |00002.2.3....oem| 00000020 2e 74 61 72 2e 62 7a 32 e7 b2 14 5c 20 8e a4 04 |.tar.bz2...\ ...| 00000030 01 01 02 00 00 00 03 01 30 00 00 00 31 39 63 62 |........0...19cb| 00000040 66 32 35 30 2d 62 62 33 65 2d 34 62 38 63 2d 62 |f250-bb3e-4b8c-b| 00000050 35 62 66 2d 31 61 34 30 64 61 36 36 31 30 66 65 |5bf-1a40da6610fe| 00000060 2e 31 35 30 38 33 2e 36 34 32 31 30 00 01 02 00 |.15083.64210....| 00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 |.............| 0000007d
为什么PUT对象之后,在.dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 的omap中还有两个key-value对,它们是干什么用的?
我们打开所有OSD的debug-objclass,查看下
ceph tell osd.\* injectargs --debug-objclass 20
我们在日志中可以看到如下内容:
ceph-client.radosgw.0的日志: -------------------------------------- 2018-12-15 15:53:11.079498 7f45723c7700 10 moving default.rgw.data.root+.bucket.meta.bucket_0:19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1 to cache LRU end 2018-12-15 15:53:11.079530 7f45723c7700 20 bucket index object: .dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 2018-12-15 15:53:11.083307 7f45723c7700 20 RGWDataChangesLog::add_entry() bucket.name=bucket_0 shard_id=7 now=2018-12-15 15:53:11.0.083306s cur_expiration=1970-01-01 08:00:00.000000s 2018-12-15 15:53:11.083351 7f45723c7700 20 RGWDataChangesLog::add_entry() sending update with now=2018-12-15 15:53:11.0.083306s cur_expiration=2018-12-15 15:53:41.0.083306s 2018-12-15 15:53:11.085002 7f45723c7700 2 req 64210:0.012000:s3:PUT /bucket_0/oem.tar.bz2:put_obj:completing 2018-12-15 15:53:11.085140 7f45723c7700 2 req 64210:0.012139:s3:PUT /bucket_0/oem.tar.bz2:put_obj:op status=0 2018-12-15 15:53:11.085148 7f45723c7700 2 req 64210:0.012147:s3:PUT /bucket_0/oem.tar.bz2:put_obj:http status=200 2018-12-15 15:53:11.085159 7f45723c7700 1 ====== req done req=0x7f45723c1750 op status=0 http_status=200 ====== ceph-osd.0.log ---------------- 2018-12-15 15:53:11.080017 7faaa6fb3700 1 <cls> cls/rgw/cls_rgw.cc:689: rgw_bucket_prepare_op(): request: op=0 name=oem.tar.bz2 instance= tag=19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210 2018-12-15 15:53:11.083526 7faaa6fb3700 1 <cls> cls/rgw/cls_rgw.cc:830: rgw_bucket_complete_op(): request: op=0 name=oem.tar.bz2 instance= ver=3:1 tag=19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210 2018-12-15 15:53:11.083592 7faaa6fb3700 1 <cls> cls/rgw/cls_rgw.cc:753: read_index_entry(): existing entry: ver=-1:0 name=oem.tar.bz2 instance= locator= 2018-12-15 15:53:11.083639 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:949: rgw_bucket_complete_op(): remove_objs.size()=0 2018-12-15 15:53:12.142564 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11 2018-12-15 15:53:12.142584 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=0 2018-12-15 15:53:12.170787 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11 2018-12-15 15:53:12.170799 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=0 2018-12-15 15:53:12.194152 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11 2018-12-15 15:53:12.194167 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=1 2018-12-15 15:53:12.256510 7faaa6fb3700 10 <cls> cls/rgw/cls_rgw.cc:2591: bi_log_iterate_range 2018-12-15 15:53:12.256523 7faaa6fb3700 0 <cls> cls/rgw/cls_rgw.cc:2621: bi_log_iterate_entries start_key=<80>0_00000000002.2.3 end_key=<80>1000_
从上面的日志可以看出:
- rgw_bucket_prepare_op
- rgw_bucket_complete_op
- RGWDataChangesLog::add_entry()
在void RGWPutObj::execute()函数的最后,会调用proecssor->complete 函数:
op_ret = processor->complete(etag, &mtime, real_time(), attrs, (delete_at ? *delete_at : real_time()), if_match, if_nomatch, (user_data.empty() ? nullptr : &user_data));
complete函数会调用do_complete函数,因此直接看do_complete函数:
int RGWPutObjProcessor_Atomic::do_complete(string& etag, real_time *mtime, real_time set_mtime, map<string, bufferlist>& attrs, real_time delete_at, const char *if_match, const char *if_nomatch, const string *user_data) { //等待该rgw对象的所有异步写完成 int r = complete_writing_data(); if (r < 0) return r; //标识该对象为Atomic类型的对象 obj_ctx.set_atomic(head_obj); // 将该rgw对象的attrs写入head对象的xattr中 RGWRados::Object op_target(store, bucket_info, obj_ctx, head_obj); /* some object types shouldn't be versioned, e.g., multipart parts */ op_target.set_versioning_disabled(!versioned_object); RGWRados::Object::Write obj_op(&op_target); obj_op.meta.data = &first_chunk; obj_op.meta.manifest = &manifest; obj_op.meta.ptag = &unique_tag; /* use req_id as operation tag */ obj_op.meta.if_match = if_match; obj_op.meta.if_nomatch = if_nomatch; obj_op.meta.mtime = mtime; obj_op.meta.set_mtime = set_mtime; obj_op.meta.owner = bucket_info.owner; obj_op.meta.flags = PUT_OBJ_CREATE; obj_op.meta.olh_epoch = olh_epoch; obj_op.meta.delete_at = delete_at; obj_op.meta.user_data = user_data; /* write_meta是一个综合操作,是我们下面分析的重点 */ r = obj_op.write_meta(obj_len, attrs); if (r < 0) { return r; } canceled = obj_op.meta.canceled; return 0; }
要探究bucket index shard中 omap中的0_00000000001.1.2和0_00000000002.2.3到底是什么,需要进入write_meta函数:
int RGWRados::Object::Write::write_meta(uint64_t size, map<string, bufferlist>& attrs) { int r = 0; RGWRados *store = target->get_store(); if ((r = this->_write_meta(store, size, attrs, true)) == -ENOTSUP) { ldout(store->ctx(), 0) << "WARNING: " << __func__ << "(): got ENOSUP, retry w/o store pg ver" << dendl; r = this->_write_meta(store, size, attrs, false); } return r; } int RGWRados::Object::Write::_write_meta(RGWRados *store, uint64_t size, map<string, bufferlist>& attrs, bool store_pg_ver) { ... r = index_op.prepare(CLS_RGW_OP_ADD); if (r < 0) return r; r = ref.ioctx.operate(ref.oid, &op); if (r < 0) { /* we can expect to get -ECANCELED if object was replaced under, or -ENOENT if was removed, or -EEXIST if it did not exist before and now it does */ goto done_cancel; } epoch = ref.ioctx.get_last_version(); poolid = ref.ioctx.get_id(); r = target->complete_atomic_modification(); if (r < 0) { ldout(store->ctx(), 0) << "ERROR: complete_atomic_modification returned r=" << r << dendl; } r = index_op.complete(poolid, epoch, size, meta.set_mtime, etag, content_type, &acl_bl, meta.category, meta.remove_objs, meta.user_data); ... }
RGWRados::Bucket::UpdateIndex::prepare
在index_op.prepare 操作中,bucket index shard 中0_00000000001.1.2 该key-value对写入。
int RGWRados::Bucket::UpdateIndex::prepare(RGWModifyOp op) { if (blind) { return 0; } RGWRados *store = target->get_store(); BucketShard *bs; int ret = get_bucket_shard(&bs); if (ret < 0) { ldout(store->ctx(), 5) << "failed to get BucketShard object: ret=" << ret << dendl; return ret; } if (obj_state && obj_state->write_tag.length()) { optag = string(obj_state->write_tag.c_str(), obj_state->write_tag.length()); } else { if (optag.empty()) { append_rand_alpha(store->ctx(), optag, optag, 32); } } ret = store->cls_obj_prepare_op(*bs, op, optag, obj, bilog_flags); return ret; } int RGWRados::cls_obj_prepare_op(BucketShard& bs, RGWModifyOp op, string& tag, rgw_obj& obj, uint16_t bilog_flags) { ObjectWriteOperation o; cls_rgw_obj_key key(obj.get_index_key_name(), obj.get_instance()); cls_rgw_bucket_prepare_op(o, op, tag, key, obj.get_loc(), get_zone().log_data, bilog_flags); int flags = librados::OPERATION_FULL_TRY; int r = bs.index_ctx.operate(bs.bucket_obj, &o, flags); return r; } void cls_rgw_bucket_prepare_op(ObjectWriteOperation& o, RGWModifyOp op, string& tag, const cls_rgw_obj_key& key, const string& locator, bool log_op, uint16_t bilog_flags) { struct rgw_cls_obj_prepare_op call; call.op = op; call.tag = tag; call.key = key; call.locator = locator; call.log_op = log_op; call.bilog_flags = bilog_flags; bufferlist in; ::encode(call, in); o.exec("rgw", "bucket_prepare_op", in); } cls/rgw/cls_rgw.cc ---------------------- void __cls_init() { ... cls_register_cxx_method(h_class, "bucket_prepare_op", CLS_METHOD_RD | CLS_METHOD_WR, rgw_bucket_prepare_op, &h_rgw_bucket_prepare_op); ... } int rgw_bucket_prepare_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out) { ... CLS_LOG(1, "rgw_bucket_prepare_op(): request: op=%d name=%s instance=%s tag=%s\n", op.op, op.key.name.c_str(), op.key.instance.c_str(), op.tag.c_str()); ... // fill in proper state struct rgw_bucket_pending_info info; info.timestamp = real_clock::now(); info.state = CLS_RGW_STATE_PENDING_MODIFY; info.op = op.op; entry.pending_map.insert(pair<string, rgw_bucket_pending_info>(op.tag, info)); struct rgw_bucket_dir_header header; rc = read_bucket_header(hctx, &header); if (rc < 0) { CLS_LOG(1, "ERROR: rgw_bucket_complete_op(): failed to read header\n"); return rc; } if (op.log_op) { //產生出0_00000000001.1.2 rc = log_index_operation(hctx, op.key, op.op, op.tag, entry.meta.mtime, entry.ver, info.state, header.ver, header.max_marker, op.bilog_flags, NULL, NULL); if (rc < 0) return rc; } // write out new key to disk bufferlist info_bl; ::encode(entry, info_bl); rc = cls_cxx_map_set_val(hctx, idx, &info_bl); if (rc < 0) return rc; return write_bucket_header(hctx, &header); }
注意上面的log_index_operation函數,我們的第一個0_00000000001.1.2對象即由該函數產生。
static void bi_log_prefix(string& key) { key = BI_PREFIX_CHAR; key.append(bucket_index_prefixes[BI_BUCKET_LOG_INDEX]); } static void bi_log_index_key(cls_method_context_t hctx, string& key, string& id, uint64_t index_ver) { bi_log_prefix(key); get_index_ver_key(hctx, index_ver, &id); key.append(id); } #define BI_PREFIX_CHAR 0x80 #define BI_BUCKET_OBJS_INDEX 0 #define BI_BUCKET_LOG_INDEX 1 #define BI_BUCKET_OBJ_INSTANCE_INDEX 2 #define BI_BUCKET_OLH_DATA_INDEX 3 #define BI_BUCKET_LAST_INDEX 4 static string bucket_index_prefixes[] = { "", /* special handling for the objs list index */ "0_", /* bucket log index */ "1000_", /* obj instance index */ "1001_", /* olh data index */ /* this must be the last index */ "9999_",};
我们可以看到,这种bi log都是以字符0x80开始,后面跟着’0_’:
key (18 bytes): 00000000 80 30 5f 30 30 30 30 30 30 30 30 30 30 32 2e 32 |.0_00000000002.2| 00000010 2e 33 |.3| 00000012
我们可以通过radosgw-admin bilog list查看相应的bilog:
{ "op_id": "7#00000000001.1.2", "op_tag": "19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210", "op": "write", "object": "oem.tar.bz2", "instance": "", "state": "pending", "index_ver": 1, "timestamp": "0.000000", "ver": { "pool": -1, "epoch": 0 }, "bilog_flags": 0, "versioned": false, "owner": "", "owner_display_name": "" },
RGWRados::Bucket::UpdateIndex::complete
介绍完UpdateIndex的prepare阶段,该介绍complete阶段了
int RGWRados::Bucket::UpdateIndex::complete(int64_t poolid, uint64_t epoch, uint64_t size, ceph::real_time& ut, string& etag, string& content_type, bufferlist *acl_bl, RGWObjCategory category, list<rgw_obj_key> *remove_objs, const string *user_data)
在该函数的末尾:
ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags); int r = store->data_log->add_entry(bs->bucket, bs->shard_id); if (r < 0) { lderr(store->ctx()) << "ERROR: failed writing data log" << dendl; } return ret;
其中store->cls_obj_complete_add这个函数:
ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags); int RGWRados::cls_obj_complete_add(BucketShard& bs, string& tag, int64_t pool, uint64_t epoch, RGWObjEnt& ent, RGWObjCategory category, list<rgw_obj_key> *remove_objs, uint16_t bilog_flags) { return cls_obj_complete_op(bs, CLS_RGW_OP_ADD, tag, pool, epoch, ent, category, remove_objs, bilog_flags); } int RGWRados::cls_obj_complete_op(BucketShard& bs, RGWModifyOp op, string& tag, int64_t pool, uint64_t epoch, RGWObjEnt& ent, RGWObjCategory category, list<rgw_obj_key> *remove_objs, uint16_t bilog_flags) { ... cls_rgw_bucket_complete_op(o, op, tag, ver, key, dir_meta, pro, get_zone().log_data, bilog_flags); ... } void cls_rgw_bucket_complete_op(ObjectWriteOperation& o, RGWModifyOp op, string& tag, rgw_bucket_entry_ver& ver, const cls_rgw_obj_key& key, rgw_bucket_dir_entry_meta& dir_meta, list<cls_rgw_obj_key> *remove_objs, bool log_op, uint16_t bilog_flags) { bufferlist in; struct rgw_cls_obj_complete_op call; call.op = op; call.tag = tag; call.key = key; call.ver = ver; call.meta = dir_meta; call.log_op = log_op; call.bilog_flags = bilog_flags; if (remove_objs) call.remove_objs = *remove_objs; ::encode(call, in); o.exec("rgw", "bucket_complete_op", in); } cls/rgw/cls_rgw.cc ------------------- int rgw_bucket_complete_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out) { ... case CLS_RGW_OP_ADD: { struct rgw_bucket_dir_entry_meta& meta = op.meta; struct rgw_bucket_category_stats& stats = header.stats[meta.category]; entry.meta = meta; entry.key = op.key; entry.exists = true; entry.tag = op.tag; stats.num_entries++; stats.total_size += meta.accounted_size; stats.total_size_rounded += cls_rgw_get_rounded_size(meta.accounted_size); bufferlist new_key_bl; ::encode(entry, new_key_bl); int ret = cls_cxx_map_set_val(hctx, idx, &new_key_bl); if (ret < 0) return ret; } break; } if (op.log_op) { rc = log_index_operation(hctx, op.key, op.op, op.tag, entry.meta.mtime, entry.ver, CLS_RGW_STATE_COMPLETE, header.ver, header.max_marker, op.bilog_flags, NULL, NULL); if (rc < 0) return rc; } }
可以看到log_index_operation 函数,这个函数是第二个 bi log
key (18 bytes): 00000000 80 30 5f 30 30 30 30 30 30 30 30 30 30 32 2e 32 |.0_00000000002.2| 00000010 2e 33 |.3| 00000012
同样,我们可以通过radosgw-admin命令查看bilog
radosgw-admin bilog list --bucket bucket_0 { "op_id": "7#00000000002.2.3", "op_tag": "19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210", "op": "write", "object": "oem.tar.bz2", "instance": "", "state": "complete", "index_ver": 2, "timestamp": "2018-12-15 07:53:11.077893152Z", "ver": { "pool": 3, "epoch": 1 }, "bilog_flags": 0, "versioned": false, "owner": "", "owner_display_name": "" },
至此,当上传对象的时候,两条bi log都介绍完了,值得注意的是key中的数值,
static void bi_log_index_key(cls_method_context_t hctx, string& key, string& id, uint64_t index_ver) { bi_log_prefix(key); get_index_ver_key(hctx, index_ver, &id); key.append(id); } static void get_index_ver_key(cls_method_context_t hctx, uint64_t index_ver, string *key) { char buf[48]; snprintf(buf, sizeof(buf), "%011llu.%llu.%d", (unsigned long long)index_ver, (unsigned long long)cls_current_version(hctx), cls_current_subop_num(hctx)); *key = buf; } uint64_t cls_current_version(cls_method_context_t hctx) { ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx; return ctx->pg->info.last_user_version; } int cls_current_subop_num(cls_method_context_t hctx) { ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx; return ctx->current_osd_subop_num; }
Ceph保证了后面的序列部分单调递增。这个单调性对于multisite增量同步比较重要。
data_log
在UpdateIndex::complete 函数中,有如下内容:
ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags); int r = store->data_log->add_entry(bs->bucket, bs->shard_id); if (r < 0) { lderr(store->ctx()) << "ERROR: failed writing data log" << dendl; } return ret;
其中store->data_log->add_entry 即为往default.rgw.log 对应的data_log条目中增加log的部分。
当对bucket进行对象操作时,会在omap上新建一条”1_”+
bucket 与 data_log.X的映射
以我们的Jewel 版本为例,bucket的shard个数为 8, 而default.rgw.log 中data_log.X 对象的个数为rgw_data_log_num_shards,即128个,RGW提供了两者的映射关系:
int RGWDataChangesLog::choose_oid(const rgw_bucket_shard& bs) { const string& name = bs.bucket.name; int shard_shift = (bs.shard_id > 0 ? bs.shard_id : 0); uint32_t r = (ceph_str_hash_linux(name.c_str(), name.size()) + shard_shift) % num_shards; return (int)r; }
加入我们有N个bucket,每个bucket shards是8,也就是将8*N个对象通过choose_oid映射到128个data_log.X对象。
上传一个对象之后,我们可以从default.rgw.log 的data_log.X的omap信息中得到一笔新的key-value信息:
root@NODE-246:/var/log# rados -p default.rgw.log ls |grep data_log |xargs -I {} rados -p default.rgw.log listomapvals {} 1_1544942616.469385_1491.1 value (185 bytes) : 00000000 02 01 b3 00 00 00 00 00 00 00 37 00 00 00 62 75 |..........7...bu| 00000010 63 6b 65 74 5f 30 3a 31 39 63 62 66 32 35 30 2d |cket_0:19cbf250-| 00000020 62 62 33 65 2d 34 62 38 63 2d 62 35 62 66 2d 31 |bb3e-4b8c-b5bf-1| 00000030 61 34 30 64 61 36 36 31 30 66 65 2e 31 35 30 38 |a40da6610fe.1508| 00000040 33 2e 31 3a 37 18 f4 15 5c 44 40 fa 1b 4a 00 00 |3.1:7...\D@..J..| 00000050 00 01 01 44 00 00 00 01 37 00 00 00 62 75 63 6b |...D....7...buck| 00000060 65 74 5f 30 3a 31 39 63 62 66 32 35 30 2d 62 62 |et_0:19cbf250-bb| 00000070 33 65 2d 34 62 38 63 2d 62 35 62 66 2d 31 61 34 |3e-4b8c-b5bf-1a4| 00000080 30 64 61 36 36 31 30 66 65 2e 31 35 30 38 33 2e |0da6610fe.15083.| 00000090 31 3a 37 18 f4 15 5c 44 40 fa 1b 1a 00 00 00 31 |1:7...\D@......1| 000000a0 5f 31 35 34 34 39 34 32 36 31 36 2e 34 36 39 33 |_1544942616.4693| 000000b0 38 35 5f 31 34 39 31 2e 31 |85_1491.1| 000000b9
这个键值命名的规范是如何的?
cls/log/cls_log.cc ----------------------- static string log_index_prefix = "1_"; static void get_index(cls_method_context_t hctx, utime_t& ts, string& index) { get_index_time_prefix(ts, index); string unique_id; cls_cxx_subop_version(hctx, &unique_id); index.append(unique_id); } static void get_index_time_prefix(utime_t& ts, string& index) { char buf[32]; snprintf(buf, sizeof(buf), "%010ld.%06ld_", (long)ts.sec(), (long)ts.usec()); index = log_index_prefix + buf; } uint64_t cls_current_version(cls_method_context_t hctx) { ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx; return ctx->pg->info.last_user_version; } int cls_current_subop_num(cls_method_context_t hctx) { ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx; return ctx->current_osd_subop_num; } void cls_cxx_subop_version(cls_method_context_t hctx, string *s) { if (!s) return; char buf[32]; uint64_t ver = cls_current_version(hctx); int subop_num = cls_current_subop_num(hctx); snprintf(buf, sizeof(buf), "%lld.%d", (long long)ver, subop_num); *s = buf; }
1_1544942616.469385_1491.1这个键值也是一样,ceph保证其单调递增的特性。当multisite同步的时候,这个特性很重要。
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- centos创建逻辑卷和扩容逻辑卷
- AI「王道」逻辑编程的复兴?清华提出神经逻辑机,已入选ICLR
- 内聚代码提高逻辑可读性,用MCVP接续你的大逻辑
- 逻辑式编程语言极简实现(使用C#) - 1. 逻辑式编程语言介绍
- 什么是逻辑数据字典?
- 逻辑回归——详细概述
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
C++ Primer 中文版(第 4 版)
Stanley B.Lippman、Josée LaJoie、Barbara E.Moo / 李师贤、蒋爱军、梅晓勇、林瑛 / 人民邮电出版社 / 2006 / 99.00元
本书是久负盛名的C++经典教程,其内容是C++大师Stanley B. Lippman丰富的实践经验和C++标准委员会原负责人Josée Lajoie对C++标准深入理解的完美结合,已经帮助全球无数程序员学会了C++。本版对前一版进行了彻底的修订,内容经过了重新组织,更加入了C++ 先驱Barbara E. Moo在C++教学方面的真知灼见。既显著改善了可读性,又充分体现了C++语言的最新进展和当......一起来看看 《C++ Primer 中文版(第 4 版)》 这本书的介绍吧!