Some notes about Ceph
Laurent Barbe @CCM Benchmark

Dealing with some osd timeouts

In some cases, some operations may take a little longer to be processed by the osd. And the operation may fail, or even make the OSD to suicide. There are many parameters for these timeouts. Some examples :

Thread suicide timed out

heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ee3ca7700' had suicide timed out after 150
common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f1f0c2a3700 time 2017-03-03 11:03:46.550118
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

In ceph.conf :

osd_op_thread_suicide_timeout = 900

Operation thread timeout

heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd306416700' had timed out after 15
ceph tell osd.XX injectargs --osd-op-thread-timeout 90
(default value is 15s)

Recovery thread timout

heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f4c2edab700' had timed out after 30
ceph tell osd.XX injectargs --osd-recovery-thread-timeout 180
(default value is 30s)

For more details, please refer to ceph documentation :