Dealing with some osd timeouts

Fri 03 March 2017

Certain operations may occasionally take longer for the OSD to process. And the operation may fail, or even make the OSD to suicide. There are many parameters for these timeouts. Some examples :

Thread suicide timed out

heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ee3ca7700' had suicide timed out after 150
common/HeartbeatMap …

Erasure code on small clusters

Fri 27 January 2017

Erasure code is rather designed for clusters with a sufficient size. However if you want to use it with a small amount of hosts you can also adapt the crushmap for a better matching distribution to your need.

Here a first example for distributing data with 1 host OR 2 …

Crushmap for 2 DC

Mon 23 January 2017

Update 2021: Since Pacific version, there is a specific operating mode for the monitors in case of an stretched cluster. See : https://docs.ceph.com/en/latest/rados/operations/stretch-mode/ For more informations, check out Greg's talk on Fostem

An example of crushmap for 2 Datacenter replication with 2 or …

Change log level on the fly to Ceph daemons

Fri 20 January 2017

Aaahhh full disk this morning. Sometimes the logs can go crazy, and the files can quickly reach several gigabytes.

Show debug option (on host) :

# Look at log file
tail -n 1000 /var/log/ceph/ceph-osd.33.log

# Check debug levels
ceph daemon osd.33 config show | grep '"debug_'
    "debug_none …

Main new features in the latest versions of ceph

Fri 06 January 2017

It's always pleasant to see how fast new features appear in Ceph. :)

Here is a non-exhaustive list of some of theme on the latest releases :

Kraken (October 2016)

BlueStore declared as stable
AsyncMessenger
RGW : metadata indexing via Elasticseasrch, index resharding, compression
S3 bucket lifecycle API, RGW Export NFS version 3 …

← Older Newer →

CephNotes

Some notes about Ceph
Laurent Barbe @Adelius / INRAE