CephNotes

Some notes about Ceph
Laurent Barbe @CCM Benchmark

Erasure Code on Small Clusters

Erasure code is rather designed for clusters with a sufficient size. However if you want to use it with a small amount of hosts you can also adapt the crushmap for a better matching distribution to your need.

Here a first example for distributing data with 1 host OR 2 drive fault tolerance with k=4, m=2 on 3 hosts and more.

1
2
3
4
5
6
7
8
9
10
rule erasure_ruleset {
  ruleset X
  type erasure
  min_size 6
  max_size 6
  step take default
  step choose indep 3 type host
  step choose indep 2 type osd
  step emit
}

Crushmap for 2 DC

An example of crushmap for 2 Datacenter replication :

1
2
3
4
5
6
7
8
9
10
rule replicated_ruleset {
  ruleset X
  type replicated
  min_size 2
  max_size 3
  step take default
  step choose firstn 2 type datacenter
  step chooseleaf firstn -1 type host
  step emit
}

This working well with pool size=2 (not recommended!) or 3. If you set pool size more than 3 (and increase the max_size in crush), be careful : you will have n-1 replica on one side and only one on the other datacenter.

If you want to be able to write data even when one of the datacenters is inaccessible, pool min_size should be set at 1 even if size is set to 3. In this case, pay attention to the monitors location.

Change Log Level on the Fly to Ceph Daemons

Aaahhh full disk this morning. Sometimes the logs can go crazy, and the files can quickly reach several gigabytes.

Show debug option (on host) :

# Look at log file
tail -n 1000 /var/log/ceph/ceph-osd.33.log

# Check debug levels
ceph daemon osd.33 config show | grep '"debug_'
    "debug_none": "0\/5",
    "debug_lockdep": "0\/1",
    "debug_context": "0\/1",
    "debug_crush": "1\/1",
    "debug_mds": "1\/5",
    ...
    "debug_filestore": "1\/5",
    ...

In my case it was about filestore, so “ceph tell” is my friend to apply the new value to the whole cluster (on admin host) :

ceph tell osd.* injectargs --debug-filestore 0/5

Now you can remove the log file on reopen it :

rm /var/log/ceph/ceph-osd.33.log

ceph daemon osd.33 log reopen

Then it will remain to be added in the ceph.conf file (on each osd hosts) :

[osd]
        debug filestore = 0/5

Main New Features in the Latest Versions of Ceph

It’s always pleasant to see how fast new features appear in Ceph. :)

Here is a non-exhaustive list of some of theme on the latest releases :

Kraken (October 2016)

  • BlueStore declared as stable
  • AsyncMessenger
  • RGW : metadata indexing via Elasticseasrch, index resharding, compression
  • S3 bucket lifecycle API, RGW Export NFS version 3 throw Ganesha
  • Rados support overwrites on erasure-coded pools / RBD on erasure coded pool (experimental)

Jewel (April 2016)

  • CephFS declared as stable
  • RGW multisite rearchitected (Allow active/active configuration)
  • AWS4 compatibility
  • RBD mirroring
  • BlueStore (experimental)

Check OSD Version

Occasionally it may be useful to check the version of the OSD on the entire cluster :

1
ceph tell osd.* version

Find the OSD Location

Of course, the simplest way is using the command ceph osd tree.

Note that, if an osd is down, you can see “last address” in ceph health detail :

1
2
3
$ ceph health detail
...
osd.37 is down since epoch 16952, last address 172.16.4.68:6804/628

Also, you can use:

1
2
3
4
5
6
7
8
9
10
11
12
$ ceph osd find 37
{
    "osd": 37,
    "ip": "172.16.4.68:6804\/636",
    "crush_location": {
        "datacenter": "pa2.ssdr",
        "host": "lxc-ceph-main-front-osd-03.ssdr",
        "physical-host": "store-front-03.ssdr",
        "rack": "pa2-104.ssdr",
        "root": "ssdr"
    }
}

To get partition UUID, you can use ceph osd dump (see at the end of the line) :

1
2
3
4
5
6
$ ceph osd dump | grep ^osd.37
osd.37 down out weight 0 up_from 56847 up_thru 57230 down_at 57538 last_clean_interval [56640,56844) 172.16.4.72:6801/16852 172.17.2.37:6801/16852 172.17.2.37:6804/16852 172.16.4.72:6804/16852 exists d7ab9ac1-c68c-4594-b25e-48d3a7cfd182

$ ssh 172.17.2.37 | blkid | grep d7ab9ac1-c68c-4594-b25e-48d3a7cfd182
/dev/sdg1: UUID="98594f17-eae5-45f8-9e90-cd25a8f89442" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="d7ab9ac1-c68c-4594-b25e-48d3a7cfd182"
#(Depending on how the partitions are created, PARTUUID label is not necessarily present.)

LXC 2.0.0 First Support for Ceph RBD

FYI, the first RBD support has been added to LXC commands.

Example :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Install LXC 2.0.0 (ubuntu) :
$ add-apt-repository ppa:ubuntu-lxc/lxc-stable
$ apt-get update
$ apt-get install lxc

# Add a ceph pool for lxc bloc devices :
$ ceph osd pool create lxc 64 64

# To create the container, you only need to specify "rbd" backingstore :
$ lxc-create -n ctn1 -B rbd -t debian
/dev/rbd0
debootstrap est /usr/sbin/debootstrap
Checking cache download in /var/cache/lxc/debian/rootfs-jessie-amd64 ...
Copying rootfs to /usr/lib/x86_64-linux-gnu/lxc...
Generation complete.
1
2
3
4
5
6
7
8
9
10
$ rbd showmapped
id pool image snap device
0  lxc  ctn1  -    /dev/rbd0

$ rbd -p lxc info ctn1
rbd image 'ctn1':
  size 1024 MB in 256 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.1217d.74b0dc51
  format: 1
1
2
3
4
$ lxc-start -n ctn1
$ lxc-attach -n ctn1
ctn1$ mount | grep ' / '
/dev/rbd/lxc/ctn1 on / type ext3 (rw,relatime,stripe=1024,data=ordered)
1
2
3
$ lxc-destroy -n ctn1
Removing image: 100% complete...done.
Destroyed container ctn1

Downgrade LSI 9207 to P19 Firmware

After numerous problems encountered with the P20 firmware on this card model, here are the steps I followed to flash in P19 Version.

Since, no more problems :)

The model of the card is a LSI 9207-8i (SAS2308 controler) with IT FW:

lspci | grep LSI
01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

Get OMAP Key/value Size

List the total size of all keys for each object on a pool.

object        size_keys(kB)  size_values(kB)  total(kB)  nr_keys  nr_values
meta.log.44   0              1                1          0        10
data_log.78   0              56419            56419      0        406841
meta.log.36   0              1                1          0        10
data_log.71   0              56758            56758      0        409426
data_log.111  0              56519            56519      0        405909
...