CephNotes

Some notes about Ceph
Laurent Barbe @CCM Benchmark

Check OSD Version

Occasionally it may be useful to check the version of the OSD on the entire cluster :

1
ceph tell osd.* version

Find the OSD Location

Of course, the simplest way is using the command ceph osd tree.

Note that, if an osd is down, you can see “last address” in ceph health detail :

1
2
3
$ ceph health detail
...
osd.37 is down since epoch 16952, last address 172.16.4.68:6804/628

Also, you can use:

1
2
3
4
5
6
7
8
9
10
11
12
$ ceph osd find 37
{
    "osd": 37,
    "ip": "172.16.4.68:6804\/636",
    "crush_location": {
        "datacenter": "pa2.ssdr",
        "host": "lxc-ceph-main-front-osd-03.ssdr",
        "physical-host": "store-front-03.ssdr",
        "rack": "pa2-104.ssdr",
        "root": "ssdr"
    }
}

To get partition UUID, you can use ceph osd dump (see at the end of the line) :

1
2
3
4
5
6
$ ceph osd dump | grep ^osd.37
osd.37 down out weight 0 up_from 56847 up_thru 57230 down_at 57538 last_clean_interval [56640,56844) 172.16.4.72:6801/16852 172.17.2.37:6801/16852 172.17.2.37:6804/16852 172.16.4.72:6804/16852 exists d7ab9ac1-c68c-4594-b25e-48d3a7cfd182

$ ssh 172.17.2.37 | blkid | grep d7ab9ac1-c68c-4594-b25e-48d3a7cfd182
/dev/sdg1: UUID="98594f17-eae5-45f8-9e90-cd25a8f89442" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="d7ab9ac1-c68c-4594-b25e-48d3a7cfd182"
#(Depending on how the partitions are created, PARTUUID label is not necessarily present.)

LXC 2.0.0 First Support for Ceph RBD

FYI, the first RBD support has been added to LXC commands.

Example :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Install LXC 2.0.0 (ubuntu) :
$ add-apt-repository ppa:ubuntu-lxc/lxc-stable
$ apt-get update
$ apt-get install lxc

# Add a ceph pool for lxc bloc devices :
$ ceph osd pool create lxc 64 64

# To create the container, you only need to specify "rbd" backingstore :
$ lxc-create -n ctn1 -B rbd -t debian
/dev/rbd0
debootstrap est /usr/sbin/debootstrap
Checking cache download in /var/cache/lxc/debian/rootfs-jessie-amd64 ...
Copying rootfs to /usr/lib/x86_64-linux-gnu/lxc...
Generation complete.
1
2
3
4
5
6
7
8
9
10
$ rbd showmapped
id pool image snap device
0  lxc  ctn1  -    /dev/rbd0

$ rbd -p lxc info ctn1
rbd image 'ctn1':
  size 1024 MB in 256 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.1217d.74b0dc51
  format: 1
1
2
3
4
$ lxc-start -n ctn1
$ lxc-attach -n ctn1
ctn1$ mount | grep ' / '
/dev/rbd/lxc/ctn1 on / type ext3 (rw,relatime,stripe=1024,data=ordered)
1
2
3
$ lxc-destroy -n ctn1
Removing image: 100% complete...done.
Destroyed container ctn1

Downgrade LSI 9207 to P19 Firmware

After numerous problems encountered with the P20 firmware on this card model, here are the steps I followed to flash in P19 Version.

Since, no more problems :)

The model of the card is a LSI 9207-8i (SAS2308 controler) with IT FW:

lspci | grep LSI
01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

Get OMAP Key/value Size

List the total size of all keys for each object on a pool.

object        size_keys(kB)  size_values(kB)  total(kB)  nr_keys  nr_values
meta.log.44   0              1                1          0        10
data_log.78   0              56419            56419      0        406841
meta.log.36   0              1                1          0        10
data_log.71   0              56758            56758      0        409426
data_log.111  0              56519            56519      0        405909
...

Intel 520 SSD Journal

A quick check of my Intel 520 SSD that running since 2 years on a small cluster.

smartctl -a /dev/sda
=== START OF INFORMATION SECTION ===
Model Family:     Intel 520 Series SSDs
Device Model:     INTEL SSDSC2CW060A3
Serial Number:    CVCV305200NB060AGN
LU WWN Device Id: 5 001517 8f36af9db
Firmware Version: 400i
User Capacity:    60 022 480 896 bytes [60,0 GB]
Sector Size:      512 bytes logical/physical

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       910315h+05m+29.420s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       13
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x000f   117   117   050    Pre-fail  Always       -       153797776
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       3
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   093   093   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       56808
249 NAND_Writes_1GiB        0x0013   100   100   000    Pre-fail  Always       -       33624

9 – Power on hours count

Cluster started since 2 years.

170 Available_Reservd_Space

100%

174 – Unexpected power loss

13 => Due to power loss on cluster. Everything has always well restarted. :)

187 – Uncorrectable error count

? Limit Ok

233 Media Wearout Indicator

093 => progressively decrease, I do not know if it’s completely reliable, but it is usually a good indicator.

241 – Host Writes 32MiB

1367528 => 42 Tb written by host This correspond to 60 GB per days for 3 osd. This seems normal.

249 – NAND Writes 1GiB

33624 => 33 Tb written on Nand write amplification = 0.79 That is pretty good.

The drive is a 60.0 GB. This make each LBA written about 560 times.

For clusters with a little more load, Intel DC S3700 models remains my favorite, but in my case the Intel 520 do very well their job.

RadosGW Big Index

$ rados -p .default.rgw.buckets.index listomapkeys .dir.default.1970130.1 | wc -l
166768275

With each key containing between 100 and 250 bytes, this make a very big object for rados (several GB)… Especially when migrating it from an OSD to another (this will lock all writes), moreover, the OSD containing this object will use a lot of memory …

Since the hammer release it is possible to shard the bucket index. However, you can not shard an existing one but you can setup it for new buckets. This is a very good thing for the scalability.

OpenVZ: Kernel 3.10 With Rbd Module

3.X Kernel for OpenVZ is out and it is compiled with rbd module:

root@debian:~# uname -a
Linux debian 3.10.0-3-pve #1 SMP Thu Jun 12 13:50:49 CEST 2014 x86_64 GNU/Linux

root@debian:~# modinfo rbd
filename:       /lib/modules/3.10.0-3-pve/kernel/drivers/block/rbd.ko
license:        GPL
author:         Jeff Garzik <jeff@garzik.org>
description:    rados block device
author:         Yehuda Sadeh <yehuda@hq.newdream.net>
author:         Sage Weil <sage@newdream.net>
srcversion:     F459625E3E9943C5880D8BE
depends:        libceph
intree:         Y
vermagic:       3.10.0-3-pve SMP mod_unload modversions 

There will be new things to test …

By default, no CephFS module.

The announcement of the publication of the source code : http://lists.openvz.org/pipermail/announce/2015-April/000579.html