CephNotes

Some notes about Ceph
Laurent Barbe @Adelius / INRAE

Dealing with some osd timeouts

Certain operations may occasionally take longer for the OSD to process. And the operation may fail, or even make the OSD to suicide. There are many parameters for these timeouts. Some examples :

Thread suicide timed out

heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ee3ca7700' had suicide timed out after 150
common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f1f0c2a3700 time 2017-03-03 11:03:46.550118
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

In ceph.conf :

[osd]
osd_op_thread_suicide_timeout = 900

Operation thread timeout

heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd306416700' had timed out after 15
ceph tell osd.XX injectargs --osd-op-thread-timeout 90
(default value is 15s)

Recovery thread timout

heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f4c2edab700' had timed out after 30
ceph tell osd.XX injectargs --osd-recovery-thread-timeout 180
(default value is 30s)

For more details, please refer to ceph documentation :

http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/

misc

Comments