CephNotes

Some notes about Ceph
Laurent Barbe @CCM Benchmark

Get OMAP Key/value Size

List the total size of all keys for each object on a pool.

object        size_keys(kB)  size_values(kB)  total(kB)  nr_keys  nr_values
meta.log.44   0              1                1          0        10
data_log.78   0              56419            56419      0        406841
meta.log.36   0              1                1          0        10
data_log.71   0              56758            56758      0        409426
data_log.111  0              56519            56519      0        405909
...

Intel 520 SSD Journal

A quick check of my Intel 520 SSD that running since 2 years on a small cluster.

smartctl -a /dev/sda
=== START OF INFORMATION SECTION ===
Model Family:     Intel 520 Series SSDs
Device Model:     INTEL SSDSC2CW060A3
Serial Number:    CVCV305200NB060AGN
LU WWN Device Id: 5 001517 8f36af9db
Firmware Version: 400i
User Capacity:    60 022 480 896 bytes [60,0 GB]
Sector Size:      512 bytes logical/physical

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       910315h+05m+29.420s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       13
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x000f   117   117   050    Pre-fail  Always       -       153797776
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       3
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   093   093   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       56808
249 NAND_Writes_1GiB        0x0013   100   100   000    Pre-fail  Always       -       33624

9 – Power on hours count

Cluster started since 2 years.

170 Available_Reservd_Space

100%

174 – Unexpected power loss

13 => Due to power loss on cluster. Everything has always well restarted. :)

187 – Uncorrectable error count

? Limit Ok

233 Media Wearout Indicator

093 => progressively decrease, I do not know if it’s completely reliable, but it is usually a good indicator.

241 – Host Writes 32MiB

1367528 => 42 Tb written by host This correspond to 60 GB per days for 3 osd. This seems normal.

249 – NAND Writes 1GiB

33624 => 33 Tb written on Nand write amplification = 0.79 That is pretty good.

The drive is a 60.0 GB. This make each LBA written about 560 times.

For clusters with a little more load, Intel DC S3700 models remains my favorite, but in my case the Intel 520 do very well their job.

RadosGW Big Index

$ rados -p .default.rgw.buckets.index listomapkeys .dir.default.1970130.1 | wc -l
166768275

With each key containing between 100 and 250 bytes, this make a very big object for rados (several GB)… Especially when migrating it from an OSD to another (this will lock all writes), moreover, the OSD containing this object will use a lot of memory …

Since the hammer release it is possible to shard the bucket index. However, you can not shard an existing one but you can setup it for new buckets. This is a very good thing for the scalability.

OpenVZ: Kernel 3.10 With Rbd Module

3.X Kernel for OpenVZ is out and it is compiled with rbd module:

root@debian:~# uname -a
Linux debian 3.10.0-3-pve #1 SMP Thu Jun 12 13:50:49 CEST 2014 x86_64 GNU/Linux

root@debian:~# modinfo rbd
filename:       /lib/modules/3.10.0-3-pve/kernel/drivers/block/rbd.ko
license:        GPL
author:         Jeff Garzik <jeff@garzik.org>
description:    rados block device
author:         Yehuda Sadeh <yehuda@hq.newdream.net>
author:         Sage Weil <sage@newdream.net>
srcversion:     F459625E3E9943C5880D8BE
depends:        libceph
intree:         Y
vermagic:       3.10.0-3-pve SMP mod_unload modversions 

There will be new things to test …

By default, no CephFS module.

The announcement of the publication of the source code : http://lists.openvz.org/pipermail/announce/2015-April/000579.html

Ceph Pool Migration

You have probably already be faced to migrate all objects from a pool to another, especially to change parameters that can not be modified on pool. For example, to migrate from a replicated pool to an EC pool, change EC profile, or to reduce the number of PGs… There are different methods, depending on the contents of the pool (RBD, objects), size…

The simple way

The simplest and safest method to copy all objects with the “rados cppool” command. However, it need to have read only access to the pool during the copy.

For example for migrating to an EC pool :

1
2
3
4
5
pool=testpool
ceph osd pool create $pool.new 4096 4096 erasure default
rados cppool $pool $pool.new
ceph osd pool rename $pool $pool.old
ceph osd pool rename $pool.new $pool

But it does not work in all cases. For example with EC pools : “error copying pool testpool => newpool: (95) Operation not supported”.

Using Cache Tier

This must to be used with caution, make tests before using it on a cluster in production. It worked for my needs, but I can not say that it works in all cases.

I find this method interesting method, because it allows transparent operation, reduce downtime and avoid to duplicate all data. The principle is simple: use the cache tier, but in reverse order.

At the begning, we have 2 pools : the current “testpool”, and the new one “newpool”

Setup cache tier

Configure the existing pool as cache pool :

1
2
ceph osd tier add newpool testpool --force-nonempty
ceph osd tier cache-mode testpool forward

In ceph osd dump you should see something like that :

--> pool 58 'testpool' replicated size 3 .... tier_of 80 

Now, all new objects will be create on new pool :

Now we can force to move all objects to new pool :

1
rados -p testpool cache-flush-evict-all

Switch all clients to the new pool

(You can also do this step earlier. For example, just after the cache pool creation.) Until all the data has not been flushed to the new pool you need to specify an overlay to search objects on old pool :

1
ceph osd tier set-overlay newpool testpool

In ceph osd dump you should see something like that :

--> pool 80 'newpool' replicated size 3 .... tiers 58 read_tier 58 write_tier 58

With overlay, all operation will be forwarded to the old testpool :

Now you can switch all the clients to access objects on the new pool.

Finish

When all data is migrate, you can remove overlay and old “cache” pool :

1
2
ceph osd tier remove-overlay newpool
ceph osd tier remove newpool testpool

Get the Number of Placement Groups Per Osd

Get the PG distribution per osd in command line :

pool :  0   1   2   3   | SUM 
------------------------------------------------
osd.10  6   6   6   84  | 102
osd.11  7   6   6   76  | 95
osd.12  4   4   3   56  | 67
osd.20  5   5   5   107 | 122
osd.13  3   3   3   73  | 82
osd.21  9   10  10  110 | 139
osd.14  3   3   3   85  | 94
osd.15  6   6   6   87  | 105
osd.22  6   6   5   87  | 104
osd.23  10  10  10  87  | 117
osd.16  7   7   7   102 | 123
osd.17  5   5   5   99  | 114
osd.18  4   4   4   103 | 115
osd.19  7   7   7   112 | 133
osd.0   5   5   5   72  | 87
osd.1   5   5   6   83  | 99
osd.2   3   3   3   74  | 83
osd.3   5   5   5   61  | 76
osd.4   3   3   4   76  | 86
osd.5   5   5   5   78  | 93
osd.6   3   2   2   78  | 85
osd.7   3   3   3   88  | 97
osd.8   9   9   9   91  | 118
osd.9   5   6   6   79  | 96
------------------------------------------------
SUM :   128 128 128 2048    |

CRUSHMAP : Example of a Hierarchical Cluster Map

It is not always easy to know how to organize your data in the Crushmap, especially when trying to distribute the data geographically while separating different types of discs, eg SATA, SAS and SSD. Let’s see what we can imagine as Crushmap hierarchy.

Take a simple example of a distribution on two datacenters. (Model 1.1)