内容简介:本文对MultiSite内部数据结构和流程做一些梳理,加深对RadosGW内部流程的理解。因为MultiSite如何搭建,网上有较多的资料,因此不再赘述。本文中创建的zonegroup为xxxx,两个zone:zonegroup相关的信息如下:
前言
本文对MultiSite内部数据结构和流程做一些梳理,加深对RadosGW内部流程的理解。因为MultiSite如何搭建,网上有较多的资料,因此不再赘述。
本文中创建的zonegroup为xxxx,两个zone:
- master
- secondary
zonegroup相关的信息如下:
{ "id": "9908295f-d8f5-4ac3-acd7-c955a177bd09", "name": "xxxx", "api_name": "", "is_master": "true", "endpoints": [ "http:\/\/s3.246.com\/" ], "hostnames": [], "hostnames_s3website": [], "master_zone": "8aa27332-01da-486a-994c-1ce527fa2fd7", "zones": [ { "id": "484742ba-f8b7-4681-8411-af96ac778150", "name": "secondary", "endpoints": [ "http:\/\/s3.243.com\/" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false" }, { "id": "8aa27332-01da-486a-994c-1ce527fa2fd7", "name": "master", "endpoints": [ "http:\/\/s3.246.com\/" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false" } ], "placement_targets": [ { "name": "default-placement", "tags": [] } ], "default_placement": "default-placement", "realm_id": "0c4b59a1-e1e7-4367-9b65-af238a2f145b" }
相关的pool
数据池(data pool)和索引池(index pool)
首当其中的pool是数据pool,即用户上传的对象数据,最终存放的地点:
root@NODE-246:/var/log/ceph# radosgw-admin zone get { "id": "8aa27332-01da-486a-994c-1ce527fa2fd7", "name": "master", "domain_root": "default.rgw.data.root", "control_pool": "default.rgw.control", "gc_pool": "default.rgw.gc", "log_pool": "default.rgw.log", "intent_log_pool": "default.rgw.intent-log", "usage_log_pool": "default.rgw.usage", "user_keys_pool": "default.rgw.users.keys", "user_email_pool": "default.rgw.users.email", "user_swift_pool": "default.rgw.users.swift", "user_uid_pool": "default.rgw.users.uid", "system_key": { "access_key": "B9494C9XE7L7N50E9K2V", "secret_key": "O8e3IYV0gxHOwy61Og5ep4f7vQWPPFPhqRXjJrYT" }, "placement_pools": [ { "key": "default-placement", "val": { "index_pool": "default.rgw.buckets.index", "data_pool": "default.rgw.buckets.data", "data_extra_pool": "default.rgw.buckets.non-ec", "index_type": 0 } } ], "metadata_heap": "", "realm_id": "0c4b59a1-e1e7-4367-9b65-af238a2f145b" }
从上面可以看出,master zone的default-placement中
作用 | pool name |
---|---|
data pool | default.rgw.buckets.data |
index pool | default.rgw.buckets.index |
data extra pool | default.rgw.buckets.non-ec |
测试版本是Jewel,尚不支持index 动态shard,我们选择index max shards=8,即每个bucket 有8个分片。
rgw_override_bucket_index_max_shards = 8
通过如下指令可以看到我们当前集群的bucket信息:
root@NODE-246:/var/log/ceph# radosgw-admin bucket stats [ { "bucket": "segtest2", "pool": "default.rgw.buckets.data", "index_pool": "default.rgw.buckets.index", "id": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769", "marker": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769", "owner": "segs3account", ... }, { "bucket": "segtest1", "pool": "default.rgw.buckets.data", "index_pool": "default.rgw.buckets.index", "id": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768", "marker": "8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768", "owner": "segs3account", ... } }
从上图可以看到,一共有两个bucket,bucket id分别是:
bucket name | bucket id |
---|---|
segtest1 | 8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768 |
segtest2 | 8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769 |
每个bucket有8个index shards,共有16个对象。
root@NODE-246:/var/log/ceph# rados -p default.rgw.buckets.index ls .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.7 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.0 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.2 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.5 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.6 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.3 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.4 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.3 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.0 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.1 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.6 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.5 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.2 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.1 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.769.4 .dir.8aa27332-01da-486a-994c-1ce527fa2fd7.4641.768.7
default.rgw.log pool
log pool记录的是各种日志信息,对于MultiSite这种使用场景,我们可以从default.rgw.log pool中找到这种命名的对象:
root@NODE-246:~# rados -p default.rgw.log ls |grep data_log data_log.0 data_log.11 data_log.12 data_log.8 data_log.14 data_log.13 data_log.10 data_log.9 data_log.7
一般来讲这种命名风格的对象最多有rgw_data_log_num_shards,对于我们的场景:
OPTION(rgw_data_log_num_shards, OPT_INT, 128)
rgw_bucket.h中可以看到如下代码:
num_shards = cct->_conf->rgw_data_log_num_shards; oids = new string[num_shards]; string prefix = cct->_conf->rgw_data_log_obj_prefix; if (prefix.empty()) { prefix = "data_log"; } for (int i = 0; i < num_shards; i++) { char buf[16]; snprintf(buf, sizeof(buf), "%s.%d", prefix.c_str(), i); oids[i] = buf; } renew_thread = new ChangesRenewThread(cct, this); renew_thread->create("rgw_dt_lg_renew")
一般来讲,该对象内容为空,相关有用的信息,都记录在omap中:
root@NODE-246:~# rados -p default.rgw.log stat data_log.61 default.rgw.log/data_log.61 mtime 2018-12-10 14:39:38.000000, size 0 root@NODE-246:~# rados -p default.rgw.log listomapkeys data_log.61 1_1544421980.298394_2914.1 1_1544422002.458109_2939.1 ... 1_1544423969.748641_4486.1 1_1544423978.090683_4495.1 1_1544424000.286801_4507.1
写入对象
概述
宏观上讲,上传一个对象到bucket中,需要写多个地方,如果同时打开了bi log和data log的话。
-
default.rgw.buckets.data : 将真实数据写入此pool,一般来讲新增一个
_ 的对象 - default.rgw.buckets.index: 当数据写入完成之后,在该对象对应的bucket index shard的omap中增加该对象的信息
- bucket index 对象的omap中增加bi log
- default.rgw.log pool中的data_log对象的omap中增加data log
bi log
当上传对象完毕之后,我们查看bucket index shard,可以看到如下内容:
root@node247:/var/log/ceph# rados -p default.rgw.buckets.index listomapkeys .dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 oem.tar.bz2 0_00000000001.1.2 0_00000000002.2.3
其中oem.tar.bz2文件是我们上传的对象,略过不提,除此意外还有两个0_00000000001.1.2和0_00000000002.2.3对象。
key (18 bytes): 00000000 80 30 5f 30 30 30 30 30 30 30 30 30 30 31 2e 31 |.0_00000000001.1| 00000010 2e 32 |.2| 00000012 value (133 bytes) : 00000000 03 01 7f 00 00 00 0f 00 00 00 30 30 30 30 30 30 |..........000000| 00000010 30 30 30 30 31 2e 31 2e 32 0b 00 00 00 6f 65 6d |00001.1.2....oem| 00000020 2e 74 61 72 2e 62 7a 32 00 00 00 00 00 00 00 00 |.tar.bz2........| 00000030 01 01 0a 00 00 00 88 ff ff ff ff ff ff ff ff 00 |................| 00000040 30 00 00 00 31 39 63 62 66 32 35 30 2d 62 62 33 |0...19cbf250-bb3| 00000050 65 2d 34 62 38 63 2d 62 35 62 66 2d 31 61 34 30 |e-4b8c-b5bf-1a40| 00000060 64 61 36 36 31 30 66 65 2e 31 35 30 38 33 2e 36 |da6610fe.15083.6| 00000070 34 32 31 30 00 00 01 00 00 00 00 00 00 00 00 00 |4210............| 00000080 00 00 00 00 00 |.....| 00000085 key (18 bytes): 00000000 80 30 5f 30 30 30 30 30 30 30 30 30 30 32 2e 32 |.0_00000000002.2| 00000010 2e 33 |.3| 00000012 value (125 bytes) : 00000000 03 01 77 00 00 00 0f 00 00 00 30 30 30 30 30 30 |..w.......000000| 00000010 30 30 30 30 32 2e 32 2e 33 0b 00 00 00 6f 65 6d |00002.2.3....oem| 00000020 2e 74 61 72 2e 62 7a 32 e7 b2 14 5c 20 8e a4 04 |.tar.bz2...\ ...| 00000030 01 01 02 00 00 00 03 01 30 00 00 00 31 39 63 62 |........0...19cb| 00000040 66 32 35 30 2d 62 62 33 65 2d 34 62 38 63 2d 62 |f250-bb3e-4b8c-b| 00000050 35 62 66 2d 31 61 34 30 64 61 36 36 31 30 66 65 |5bf-1a40da6610fe| 00000060 2e 31 35 30 38 33 2e 36 34 32 31 30 00 01 02 00 |.15083.64210....| 00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 |.............| 0000007d
为什么PUT对象之后,在.dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 的omap中还有两个key-value对,它们是干什么用的?
我们打开所有OSD的debug-objclass,查看下
ceph tell osd.\* injectargs --debug-objclass 20
我们在日志中可以看到如下内容:
ceph-client.radosgw.0的日志: -------------------------------------- 2018-12-15 15:53:11.079498 7f45723c7700 10 moving default.rgw.data.root+.bucket.meta.bucket_0:19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1 to cache LRU end 2018-12-15 15:53:11.079530 7f45723c7700 20 bucket index object: .dir.19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.1.7 2018-12-15 15:53:11.083307 7f45723c7700 20 RGWDataChangesLog::add_entry() bucket.name=bucket_0 shard_id=7 now=2018-12-15 15:53:11.0.083306s cur_expiration=1970-01-01 08:00:00.000000s 2018-12-15 15:53:11.083351 7f45723c7700 20 RGWDataChangesLog::add_entry() sending update with now=2018-12-15 15:53:11.0.083306s cur_expiration=2018-12-15 15:53:41.0.083306s 2018-12-15 15:53:11.085002 7f45723c7700 2 req 64210:0.012000:s3:PUT /bucket_0/oem.tar.bz2:put_obj:completing 2018-12-15 15:53:11.085140 7f45723c7700 2 req 64210:0.012139:s3:PUT /bucket_0/oem.tar.bz2:put_obj:op status=0 2018-12-15 15:53:11.085148 7f45723c7700 2 req 64210:0.012147:s3:PUT /bucket_0/oem.tar.bz2:put_obj:http status=200 2018-12-15 15:53:11.085159 7f45723c7700 1 ====== req done req=0x7f45723c1750 op status=0 http_status=200 ====== ceph-osd.0.log ---------------- 2018-12-15 15:53:11.080017 7faaa6fb3700 1 <cls> cls/rgw/cls_rgw.cc:689: rgw_bucket_prepare_op(): request: op=0 name=oem.tar.bz2 instance= tag=19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210 2018-12-15 15:53:11.083526 7faaa6fb3700 1 <cls> cls/rgw/cls_rgw.cc:830: rgw_bucket_complete_op(): request: op=0 name=oem.tar.bz2 instance= ver=3:1 tag=19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210 2018-12-15 15:53:11.083592 7faaa6fb3700 1 <cls> cls/rgw/cls_rgw.cc:753: read_index_entry(): existing entry: ver=-1:0 name=oem.tar.bz2 instance= locator= 2018-12-15 15:53:11.083639 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:949: rgw_bucket_complete_op(): remove_objs.size()=0 2018-12-15 15:53:12.142564 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11 2018-12-15 15:53:12.142584 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=0 2018-12-15 15:53:12.170787 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11 2018-12-15 15:53:12.170799 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=0 2018-12-15 15:53:12.194152 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:470: start_key=oem.tar.bz2 len=11 2018-12-15 15:53:12.194167 7faaa6fb3700 20 <cls> cls/rgw/cls_rgw.cc:487: got entry oem.tar.bz2[] m.size()=1 2018-12-15 15:53:12.256510 7faaa6fb3700 10 <cls> cls/rgw/cls_rgw.cc:2591: bi_log_iterate_range 2018-12-15 15:53:12.256523 7faaa6fb3700 0 <cls> cls/rgw/cls_rgw.cc:2621: bi_log_iterate_entries start_key=<80>0_00000000002.2.3 end_key=<80>1000_
从上面的日志可以看出:
- rgw_bucket_prepare_op
- rgw_bucket_complete_op
- RGWDataChangesLog::add_entry()
在void RGWPutObj::execute()函数的最后,会调用proecssor->complete 函数:
op_ret = processor->complete(etag, &mtime, real_time(), attrs, (delete_at ? *delete_at : real_time()), if_match, if_nomatch, (user_data.empty() ? nullptr : &user_data));
complete函数会调用do_complete函数,因此直接看do_complete函数:
int RGWPutObjProcessor_Atomic::do_complete(string& etag, real_time *mtime, real_time set_mtime, map<string, bufferlist>& attrs, real_time delete_at, const char *if_match, const char *if_nomatch, const string *user_data) { //等待该rgw对象的所有异步写完成 int r = complete_writing_data(); if (r < 0) return r; //标识该对象为Atomic类型的对象 obj_ctx.set_atomic(head_obj); // 将该rgw对象的attrs写入head对象的xattr中 RGWRados::Object op_target(store, bucket_info, obj_ctx, head_obj); /* some object types shouldn't be versioned, e.g., multipart parts */ op_target.set_versioning_disabled(!versioned_object); RGWRados::Object::Write obj_op(&op_target); obj_op.meta.data = &first_chunk; obj_op.meta.manifest = &manifest; obj_op.meta.ptag = &unique_tag; /* use req_id as operation tag */ obj_op.meta.if_match = if_match; obj_op.meta.if_nomatch = if_nomatch; obj_op.meta.mtime = mtime; obj_op.meta.set_mtime = set_mtime; obj_op.meta.owner = bucket_info.owner; obj_op.meta.flags = PUT_OBJ_CREATE; obj_op.meta.olh_epoch = olh_epoch; obj_op.meta.delete_at = delete_at; obj_op.meta.user_data = user_data; /* write_meta是一个综合操作,是我们下面分析的重点 */ r = obj_op.write_meta(obj_len, attrs); if (r < 0) { return r; } canceled = obj_op.meta.canceled; return 0; }
要探究bucket index shard中 omap中的0_00000000001.1.2和0_00000000002.2.3到底是什么,需要进入write_meta函数:
int RGWRados::Object::Write::write_meta(uint64_t size, map<string, bufferlist>& attrs) { int r = 0; RGWRados *store = target->get_store(); if ((r = this->_write_meta(store, size, attrs, true)) == -ENOTSUP) { ldout(store->ctx(), 0) << "WARNING: " << __func__ << "(): got ENOSUP, retry w/o store pg ver" << dendl; r = this->_write_meta(store, size, attrs, false); } return r; } int RGWRados::Object::Write::_write_meta(RGWRados *store, uint64_t size, map<string, bufferlist>& attrs, bool store_pg_ver) { ... r = index_op.prepare(CLS_RGW_OP_ADD); if (r < 0) return r; r = ref.ioctx.operate(ref.oid, &op); if (r < 0) { /* we can expect to get -ECANCELED if object was replaced under, or -ENOENT if was removed, or -EEXIST if it did not exist before and now it does */ goto done_cancel; } epoch = ref.ioctx.get_last_version(); poolid = ref.ioctx.get_id(); r = target->complete_atomic_modification(); if (r < 0) { ldout(store->ctx(), 0) << "ERROR: complete_atomic_modification returned r=" << r << dendl; } r = index_op.complete(poolid, epoch, size, meta.set_mtime, etag, content_type, &acl_bl, meta.category, meta.remove_objs, meta.user_data); ... }
RGWRados::Bucket::UpdateIndex::prepare
在index_op.prepare 操作中,bucket index shard 中0_00000000001.1.2 该key-value对写入。
int RGWRados::Bucket::UpdateIndex::prepare(RGWModifyOp op) { if (blind) { return 0; } RGWRados *store = target->get_store(); BucketShard *bs; int ret = get_bucket_shard(&bs); if (ret < 0) { ldout(store->ctx(), 5) << "failed to get BucketShard object: ret=" << ret << dendl; return ret; } if (obj_state && obj_state->write_tag.length()) { optag = string(obj_state->write_tag.c_str(), obj_state->write_tag.length()); } else { if (optag.empty()) { append_rand_alpha(store->ctx(), optag, optag, 32); } } ret = store->cls_obj_prepare_op(*bs, op, optag, obj, bilog_flags); return ret; } int RGWRados::cls_obj_prepare_op(BucketShard& bs, RGWModifyOp op, string& tag, rgw_obj& obj, uint16_t bilog_flags) { ObjectWriteOperation o; cls_rgw_obj_key key(obj.get_index_key_name(), obj.get_instance()); cls_rgw_bucket_prepare_op(o, op, tag, key, obj.get_loc(), get_zone().log_data, bilog_flags); int flags = librados::OPERATION_FULL_TRY; int r = bs.index_ctx.operate(bs.bucket_obj, &o, flags); return r; } void cls_rgw_bucket_prepare_op(ObjectWriteOperation& o, RGWModifyOp op, string& tag, const cls_rgw_obj_key& key, const string& locator, bool log_op, uint16_t bilog_flags) { struct rgw_cls_obj_prepare_op call; call.op = op; call.tag = tag; call.key = key; call.locator = locator; call.log_op = log_op; call.bilog_flags = bilog_flags; bufferlist in; ::encode(call, in); o.exec("rgw", "bucket_prepare_op", in); } cls/rgw/cls_rgw.cc ---------------------- void __cls_init() { ... cls_register_cxx_method(h_class, "bucket_prepare_op", CLS_METHOD_RD | CLS_METHOD_WR, rgw_bucket_prepare_op, &h_rgw_bucket_prepare_op); ... } int rgw_bucket_prepare_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out) { ... CLS_LOG(1, "rgw_bucket_prepare_op(): request: op=%d name=%s instance=%s tag=%s\n", op.op, op.key.name.c_str(), op.key.instance.c_str(), op.tag.c_str()); ... // fill in proper state struct rgw_bucket_pending_info info; info.timestamp = real_clock::now(); info.state = CLS_RGW_STATE_PENDING_MODIFY; info.op = op.op; entry.pending_map.insert(pair<string, rgw_bucket_pending_info>(op.tag, info)); struct rgw_bucket_dir_header header; rc = read_bucket_header(hctx, &header); if (rc < 0) { CLS_LOG(1, "ERROR: rgw_bucket_complete_op(): failed to read header\n"); return rc; } if (op.log_op) { //產生出0_00000000001.1.2 rc = log_index_operation(hctx, op.key, op.op, op.tag, entry.meta.mtime, entry.ver, info.state, header.ver, header.max_marker, op.bilog_flags, NULL, NULL); if (rc < 0) return rc; } // write out new key to disk bufferlist info_bl; ::encode(entry, info_bl); rc = cls_cxx_map_set_val(hctx, idx, &info_bl); if (rc < 0) return rc; return write_bucket_header(hctx, &header); }
注意上面的log_index_operation函數,我們的第一個0_00000000001.1.2對象即由該函數產生。
static void bi_log_prefix(string& key) { key = BI_PREFIX_CHAR; key.append(bucket_index_prefixes[BI_BUCKET_LOG_INDEX]); } static void bi_log_index_key(cls_method_context_t hctx, string& key, string& id, uint64_t index_ver) { bi_log_prefix(key); get_index_ver_key(hctx, index_ver, &id); key.append(id); } #define BI_PREFIX_CHAR 0x80 #define BI_BUCKET_OBJS_INDEX 0 #define BI_BUCKET_LOG_INDEX 1 #define BI_BUCKET_OBJ_INSTANCE_INDEX 2 #define BI_BUCKET_OLH_DATA_INDEX 3 #define BI_BUCKET_LAST_INDEX 4 static string bucket_index_prefixes[] = { "", /* special handling for the objs list index */ "0_", /* bucket log index */ "1000_", /* obj instance index */ "1001_", /* olh data index */ /* this must be the last index */ "9999_",};
我们可以看到,这种bi log都是以字符0x80开始,后面跟着’0_’:
key (18 bytes): 00000000 80 30 5f 30 30 30 30 30 30 30 30 30 30 32 2e 32 |.0_00000000002.2| 00000010 2e 33 |.3| 00000012
我们可以通过radosgw-admin bilog list查看相应的bilog:
{ "op_id": "7#00000000001.1.2", "op_tag": "19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210", "op": "write", "object": "oem.tar.bz2", "instance": "", "state": "pending", "index_ver": 1, "timestamp": "0.000000", "ver": { "pool": -1, "epoch": 0 }, "bilog_flags": 0, "versioned": false, "owner": "", "owner_display_name": "" },
RGWRados::Bucket::UpdateIndex::complete
介绍完UpdateIndex的prepare阶段,该介绍complete阶段了
int RGWRados::Bucket::UpdateIndex::complete(int64_t poolid, uint64_t epoch, uint64_t size, ceph::real_time& ut, string& etag, string& content_type, bufferlist *acl_bl, RGWObjCategory category, list<rgw_obj_key> *remove_objs, const string *user_data)
在该函数的末尾:
ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags); int r = store->data_log->add_entry(bs->bucket, bs->shard_id); if (r < 0) { lderr(store->ctx()) << "ERROR: failed writing data log" << dendl; } return ret;
其中store->cls_obj_complete_add这个函数:
ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags); int RGWRados::cls_obj_complete_add(BucketShard& bs, string& tag, int64_t pool, uint64_t epoch, RGWObjEnt& ent, RGWObjCategory category, list<rgw_obj_key> *remove_objs, uint16_t bilog_flags) { return cls_obj_complete_op(bs, CLS_RGW_OP_ADD, tag, pool, epoch, ent, category, remove_objs, bilog_flags); } int RGWRados::cls_obj_complete_op(BucketShard& bs, RGWModifyOp op, string& tag, int64_t pool, uint64_t epoch, RGWObjEnt& ent, RGWObjCategory category, list<rgw_obj_key> *remove_objs, uint16_t bilog_flags) { ... cls_rgw_bucket_complete_op(o, op, tag, ver, key, dir_meta, pro, get_zone().log_data, bilog_flags); ... } void cls_rgw_bucket_complete_op(ObjectWriteOperation& o, RGWModifyOp op, string& tag, rgw_bucket_entry_ver& ver, const cls_rgw_obj_key& key, rgw_bucket_dir_entry_meta& dir_meta, list<cls_rgw_obj_key> *remove_objs, bool log_op, uint16_t bilog_flags) { bufferlist in; struct rgw_cls_obj_complete_op call; call.op = op; call.tag = tag; call.key = key; call.ver = ver; call.meta = dir_meta; call.log_op = log_op; call.bilog_flags = bilog_flags; if (remove_objs) call.remove_objs = *remove_objs; ::encode(call, in); o.exec("rgw", "bucket_complete_op", in); } cls/rgw/cls_rgw.cc ------------------- int rgw_bucket_complete_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out) { ... case CLS_RGW_OP_ADD: { struct rgw_bucket_dir_entry_meta& meta = op.meta; struct rgw_bucket_category_stats& stats = header.stats[meta.category]; entry.meta = meta; entry.key = op.key; entry.exists = true; entry.tag = op.tag; stats.num_entries++; stats.total_size += meta.accounted_size; stats.total_size_rounded += cls_rgw_get_rounded_size(meta.accounted_size); bufferlist new_key_bl; ::encode(entry, new_key_bl); int ret = cls_cxx_map_set_val(hctx, idx, &new_key_bl); if (ret < 0) return ret; } break; } if (op.log_op) { rc = log_index_operation(hctx, op.key, op.op, op.tag, entry.meta.mtime, entry.ver, CLS_RGW_STATE_COMPLETE, header.ver, header.max_marker, op.bilog_flags, NULL, NULL); if (rc < 0) return rc; } }
可以看到log_index_operation 函数,这个函数是第二个 bi log
key (18 bytes): 00000000 80 30 5f 30 30 30 30 30 30 30 30 30 30 32 2e 32 |.0_00000000002.2| 00000010 2e 33 |.3| 00000012
同样,我们可以通过radosgw-admin命令查看bilog
radosgw-admin bilog list --bucket bucket_0 { "op_id": "7#00000000002.2.3", "op_tag": "19cbf250-bb3e-4b8c-b5bf-1a40da6610fe.15083.64210", "op": "write", "object": "oem.tar.bz2", "instance": "", "state": "complete", "index_ver": 2, "timestamp": "2018-12-15 07:53:11.077893152Z", "ver": { "pool": 3, "epoch": 1 }, "bilog_flags": 0, "versioned": false, "owner": "", "owner_display_name": "" },
至此,当上传对象的时候,两条bi log都介绍完了,值得注意的是key中的数值,
static void bi_log_index_key(cls_method_context_t hctx, string& key, string& id, uint64_t index_ver) { bi_log_prefix(key); get_index_ver_key(hctx, index_ver, &id); key.append(id); } static void get_index_ver_key(cls_method_context_t hctx, uint64_t index_ver, string *key) { char buf[48]; snprintf(buf, sizeof(buf), "%011llu.%llu.%d", (unsigned long long)index_ver, (unsigned long long)cls_current_version(hctx), cls_current_subop_num(hctx)); *key = buf; } uint64_t cls_current_version(cls_method_context_t hctx) { ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx; return ctx->pg->info.last_user_version; } int cls_current_subop_num(cls_method_context_t hctx) { ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx; return ctx->current_osd_subop_num; }
Ceph保证了后面的序列部分单调递增。这个单调性对于multisite增量同步比较重要。
data_log
在UpdateIndex::complete 函数中,有如下内容:
ret = store->cls_obj_complete_add(*bs, optag, poolid, epoch, ent, category, remove_objs, bilog_flags); int r = store->data_log->add_entry(bs->bucket, bs->shard_id); if (r < 0) { lderr(store->ctx()) << "ERROR: failed writing data log" << dendl; } return ret;
其中store->data_log->add_entry 即为往default.rgw.log 对应的data_log条目中增加log的部分。
当对bucket进行对象操作时,会在omap上新建一条”1_”+
bucket 与 data_log.X的映射
以我们的Jewel 版本为例,bucket的shard个数为 8, 而default.rgw.log 中data_log.X 对象的个数为rgw_data_log_num_shards,即128个,RGW提供了两者的映射关系:
int RGWDataChangesLog::choose_oid(const rgw_bucket_shard& bs) { const string& name = bs.bucket.name; int shard_shift = (bs.shard_id > 0 ? bs.shard_id : 0); uint32_t r = (ceph_str_hash_linux(name.c_str(), name.size()) + shard_shift) % num_shards; return (int)r; }
加入我们有N个bucket,每个bucket shards是8,也就是将8*N个对象通过choose_oid映射到128个data_log.X对象。
上传一个对象之后,我们可以从default.rgw.log 的data_log.X的omap信息中得到一笔新的key-value信息:
root@NODE-246:/var/log# rados -p default.rgw.log ls |grep data_log |xargs -I {} rados -p default.rgw.log listomapvals {} 1_1544942616.469385_1491.1 value (185 bytes) : 00000000 02 01 b3 00 00 00 00 00 00 00 37 00 00 00 62 75 |..........7...bu| 00000010 63 6b 65 74 5f 30 3a 31 39 63 62 66 32 35 30 2d |cket_0:19cbf250-| 00000020 62 62 33 65 2d 34 62 38 63 2d 62 35 62 66 2d 31 |bb3e-4b8c-b5bf-1| 00000030 61 34 30 64 61 36 36 31 30 66 65 2e 31 35 30 38 |a40da6610fe.1508| 00000040 33 2e 31 3a 37 18 f4 15 5c 44 40 fa 1b 4a 00 00 |3.1:7...\D@..J..| 00000050 00 01 01 44 00 00 00 01 37 00 00 00 62 75 63 6b |...D....7...buck| 00000060 65 74 5f 30 3a 31 39 63 62 66 32 35 30 2d 62 62 |et_0:19cbf250-bb| 00000070 33 65 2d 34 62 38 63 2d 62 35 62 66 2d 31 61 34 |3e-4b8c-b5bf-1a4| 00000080 30 64 61 36 36 31 30 66 65 2e 31 35 30 38 33 2e |0da6610fe.15083.| 00000090 31 3a 37 18 f4 15 5c 44 40 fa 1b 1a 00 00 00 31 |1:7...\D@......1| 000000a0 5f 31 35 34 34 39 34 32 36 31 36 2e 34 36 39 33 |_1544942616.4693| 000000b0 38 35 5f 31 34 39 31 2e 31 |85_1491.1| 000000b9
这个键值命名的规范是如何的?
cls/log/cls_log.cc ----------------------- static string log_index_prefix = "1_"; static void get_index(cls_method_context_t hctx, utime_t& ts, string& index) { get_index_time_prefix(ts, index); string unique_id; cls_cxx_subop_version(hctx, &unique_id); index.append(unique_id); } static void get_index_time_prefix(utime_t& ts, string& index) { char buf[32]; snprintf(buf, sizeof(buf), "%010ld.%06ld_", (long)ts.sec(), (long)ts.usec()); index = log_index_prefix + buf; } uint64_t cls_current_version(cls_method_context_t hctx) { ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx; return ctx->pg->info.last_user_version; } int cls_current_subop_num(cls_method_context_t hctx) { ReplicatedPG::OpContext *ctx = *(ReplicatedPG::OpContext **)hctx; return ctx->current_osd_subop_num; } void cls_cxx_subop_version(cls_method_context_t hctx, string *s) { if (!s) return; char buf[32]; uint64_t ver = cls_current_version(hctx); int subop_num = cls_current_subop_num(hctx); snprintf(buf, sizeof(buf), "%lld.%d", (long long)ver, subop_num); *s = buf; }
1_1544942616.469385_1491.1这个键值也是一样,ceph保证其单调递增的特性。当multisite同步的时候,这个特性很重要。
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- centos创建逻辑卷和扩容逻辑卷
- AI「王道」逻辑编程的复兴?清华提出神经逻辑机,已入选ICLR
- 内聚代码提高逻辑可读性,用MCVP接续你的大逻辑
- 逻辑式编程语言极简实现(使用C#) - 1. 逻辑式编程语言介绍
- 什么是逻辑数据字典?
- 逻辑回归——详细概述
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Blockchain Basics
Daniel Drescher / Apress / 2017-3-16 / USD 20.99
In 25 concise steps, you will learn the basics of blockchain technology. No mathematical formulas, program code, or computer science jargon are used. No previous knowledge in computer science, mathema......一起来看看 《Blockchain Basics》 这本书的介绍吧!
正则表达式在线测试
正则表达式在线测试
RGB HSV 转换
RGB HSV 互转工具