
Some notes about Ceph
Laurent Barbe @Adelius / INRAE

Ceph OSD : Where is my data ?

The purpose is to verify where my data is stored on the Ceph cluster.

For this, I have just create a minimal cluster with 3 osd :

$ ceph-deploy osd create ceph-01:/dev/sdb ceph-02:/dev/sdb ceph-03:/dev/sdb

Where is my osd directory on ceph-01 ?

$ mount | grep ceph
/dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime,attr2,delaylog,noquota)

The directory content :

$ cd /var/lib/ceph/osd/ceph-0; ls -l
total 52
-rw-r--r--   1 root root  487 août  20 12:12 activate.monmap
-rw-r--r--   1 root root    3 août  20 12:12 active
-rw-r--r--   1 root root   37 août  20 12:12 ceph_fsid
drwxr-xr-x 133 root root 8192 août  20 12:18 current
-rw-r--r--   1 root root   37 août  20 12:12 fsid
lrwxrwxrwx   1 root root   58 août  20 12:12 journal -> /dev/disk/by-partuuid/37180b7e-fe5d-4b53-8693-12a8c1f52ec9
-rw-r--r--   1 root root   37 août  20 12:12 journal_uuid
-rw-------   1 root root   56 août  20 12:12 keyring
-rw-r--r--   1 root root   21 août  20 12:12 magic
-rw-r--r--   1 root root    6 août  20 12:12 ready
-rw-r--r--   1 root root    4 août  20 12:12 store_version
-rw-r--r--   1 root root    0 août  20 12:12 sysvinit
-rw-r--r--   1 root root    2 août  20 12:12 whoami

$ du -hs *
4,0K    activate.monmap  The current monmap
4,0K    active       "ok"
4,0K    ceph_fsid    cluster fsid (same return by 'ceph fsid')
2,1M    current
4,0K    fsid         id for this osd
0   journal          symlink to journal partition
4,0K    journal_uuid
4,0K    keyring      the key
4,0K    magic        "ceph osd volume v026"
4,0K    ready        "ready"
4,0K    store_version   
0   sysvinit
4,0K    whoami       id of the osd

The data are store in the directory "current" : It contains some file and many _head file :

$ cd current; ls -l | grep -v head
total 20
-rw-r--r-- 1 root root     5 août  20 12:18 commit_op_seq
drwxr-xr-x 2 root root 12288 août  20 12:18 meta
-rw-r--r-- 1 root root     0 août  20 12:12 nosnap
drwxr-xr-x 2 root root   111 août  20 12:12 omap

In omap directory :

$ cd omap; ls -l
-rw-r--r-- 1 root root     150 août  20 12:12 000007.sst
-rw-r--r-- 1 root root 2031616 août  20 12:18 000010.log    
-rw-r--r-- 1 root root      16 août  20 12:12 CURRENT
-rw-r--r-- 1 root root       0 août  20 12:12 LOCK
-rw-r--r-- 1 root root     172 août  20 12:12 LOG
-rw-r--r-- 1 root root     309 août  20 12:12 LOG.old
-rw-r--r-- 1 root root   65536 août  20 12:12 MANIFEST-000009

In meta directory :

$ cd ../meta; ls -l
total 940
-rw-r--r-- 1 root root  710 août  20 12:14 inc\uosdmap.10__0_F4E9C003__none
-rw-r--r-- 1 root root  958 août  20 12:12 inc\uosdmap.1__0_B65F4306__none
-rw-r--r-- 1 root root  722 août  20 12:14 inc\uosdmap.11__0_F4E9C1D3__none
-rw-r--r-- 1 root root  152 août  20 12:14 inc\uosdmap.12__0_F4E9C163__none
-rw-r--r-- 1 root root  153 août  20 12:12 inc\uosdmap.2__0_B65F40D6__none
-rw-r--r-- 1 root root  574 août  20 12:12 inc\uosdmap.3__0_B65F4066__none
-rw-r--r-- 1 root root  153 août  20 12:12 inc\uosdmap.4__0_B65F4136__none
-rw-r--r-- 1 root root  722 août  20 12:12 inc\uosdmap.5__0_B65F46C6__none
-rw-r--r-- 1 root root  136 août  20 12:14 inc\uosdmap.6__0_B65F4796__none
-rw-r--r-- 1 root root  642 août  20 12:14 inc\uosdmap.7__0_B65F4726__none
-rw-r--r-- 1 root root  153 août  20 12:14 inc\uosdmap.8__0_B65F44F6__none
-rw-r--r-- 1 root root  722 août  20 12:14 inc\uosdmap.9__0_B65F4586__none
-rw-r--r-- 1 root root    0 août  20 12:12 infos__head_16EF7597__none
-rw-r--r-- 1 root root 2870 août  20 12:14 osdmap.10__0_6417091C__none
-rw-r--r-- 1 root root  830 août  20 12:12 osdmap.1__0_FD6E49B1__none
-rw-r--r-- 1 root root 2870 août  20 12:14 osdmap.11__0_64170EAC__none
-rw-r--r-- 1 root root 2870 août  20 12:14 osdmap.12__0_64170E7C__none    current osdmap
-rw-r--r-- 1 root root 1442 août  20 12:12 osdmap.2__0_FD6E4941__none
-rw-r--r-- 1 root root 1510 août  20 12:12 osdmap.3__0_FD6E4E11__none
-rw-r--r-- 1 root root 2122 août  20 12:12 osdmap.4__0_FD6E4FA1__none
-rw-r--r-- 1 root root 2122 août  20 12:12 osdmap.5__0_FD6E4F71__none
-rw-r--r-- 1 root root 2122 août  20 12:14 osdmap.6__0_FD6E4C01__none
-rw-r--r-- 1 root root 2190 août  20 12:14 osdmap.7__0_FD6E4DD1__none
-rw-r--r-- 1 root root 2802 août  20 12:14 osdmap.8__0_FD6E4D61__none
-rw-r--r-- 1 root root 2802 août  20 12:14 osdmap.9__0_FD6E4231__none
-rw-r--r-- 1 root root  354 août  20 12:14 osd\usuperblock__0_23C2FCDE__none
-rw-r--r-- 1 root root    0 août  20 12:12 pglog\u0.0__0_103B076E__none     Log for each pg
-rw-r--r-- 1 root root    0 août  20 12:12 pglog\u0.1__0_103B043E__none
-rw-r--r-- 1 root root    0 août  20 12:12 pglog\u0.11__0_5172C9DB__none
-rw-r--r-- 1 root root    0 août  20 12:12 pglog\u0.13__0_5172CE3B__none
-rw-r--r-- 1 root root    0 août  20 12:13 pglog\u0.15__0_5172CC9B__none
-rw-r--r-- 1 root root    0 août  20 12:13 pglog\u0.16__0_5172CC2B__none
-rw-r--r-- 1 root root    0 août  20 12:12 snapmapper__0_A468EC03__noneosd

Try decompiling crush map from osdmap :

$ ceph osd stat
e12: 3 osds: 3 up, 3 in

$ osdmaptool osdmap.12__0_64170E7C__none --export-crush /tmp/crushmap.bin
osdmaptool: osdmap file 'osdmap.12__0_64170E7C__none'
osdmaptool: exported crush map to /tmp/crushmap.bin

$ crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt

$ cat /tmp/crushmap.txt
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host ceph-01 {
    id -2       # do not change unnecessarily
    # weight 0.050
    alg straw
    hash 0  # rjenkins1
    item osd.0 weight 0.050
host ceph-02 {
    id -3       # do not change unnecessarily
    # weight 0.050
    alg straw
    hash 0  # rjenkins1
    item osd.1 weight 0.050
host ceph-03 {
    id -4       # do not change unnecessarily
    # weight 0.050
    alg straw
    hash 0  # rjenkins1
    item osd.2 weight 0.050
root default {
    id -1       # do not change unnecessarily
    # weight 0.150
    alg straw
    hash 0  # rjenkins1
    item ceph-01 weight 0.050
    item ceph-02 weight 0.050
    item ceph-03 weight 0.050


# end crush map

Ok it's what I expect. :)

The cluster is empty :

$ find *_head -type f | wc -l

The directory list correspond to the 'ceph pg dump'

$ for dir in ` ceph pg dump | grep '\[0,' | cut -f1 `; do if [ -d $dir_head ]; then echo exist; else echo nok; fi; done | sort | uniq -c
dumped all in format plain
     69 exist

To get all stats for a specific pg :

$ ceph pg 0.1 query
{ "state": "active+clean",
  "epoch": 12,
  "up": [
  "acting": [
  "info": { "pgid": "0.1",
      "last_update": "0'0",
      "last_complete": "0'0",
      "log_tail": "0'0",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 1,
          "last_epoch_started": 12,
          "last_epoch_clean": 12,
          "last_epoch_split": 0,
          "same_up_since": 9,
          "same_interval_since": 9,
          "same_primary_since": 5,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2013-08-20 12:12:37.851559",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2013-08-20 12:12:37.851559",
          "last_clean_scrub_stamp": "0.000000"},
      "stats": { "version": "0'0",
          "reported_seq": "12",
          "reported_epoch": "12",
          "state": "active+clean",
          "last_fresh": "2013-08-20 12:16:22.709534",
          "last_change": "2013-08-20 12:16:22.105099",
          "last_active": "2013-08-20 12:16:22.709534",
          "last_clean": "2013-08-20 12:16:22.709534",
          "last_became_active": "0.000000",
          "last_unstale": "2013-08-20 12:16:22.709534",
          "mapping_epoch": 5,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 1,
          "last_epoch_clean": 12,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2013-08-20 12:12:37.851559",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2013-08-20 12:12:37.851559",
          "last_clean_scrub_stamp": "0.000000",
          "log_size": 0,
          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
          "acting": [
      "empty": 1,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 12},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-08-20 12:15:30.102250",
          "might_have_unfound": [],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-08-20 12:14:51.501628"}]}

Retrieve an object on the cluster

In this test we create a standard pool (pgnum=8 and repli=2)

$ rados mkpool testpool
$ wget -q http://ceph.com/docs/master/_static/logo.png
$ md5sum logo.png
4c7c15e856737efc0d2d71abde3c6b28  logo.png

$ rados put -p testpool logo.png logo.png
$ ceph osd map testpool logo.png
osdmap e14 pool 'testpool' (3) object 'logo.png' -> pg 3.9e17671a (3.2) -> up [2,1] acting [2,1]

My Ceph logo is on pg 3.2 (main on osd.2 and replica on osd.1)

$ ceph osd tree
# id    weight  type name   up/down reweight
-1  0.15    root default
-2  0.04999     host ceph-01
0   0.04999         osd.0   up  1   
-3  0.04999     host ceph-02
1   0.04999         osd.1   up  1   
-4  0.04999     host ceph-03
2   0.04999         osd.2   up  1

And osd.2 is on ceph-03 :

$ cd /var/lib/ceph/osd/ceph-2/current/3.2_head/
$ ls
$ md5sum logo.png__head_9E17671A__3
4c7c15e856737efc0d2d71abde3c6b28  logo.png__head_9E17671A__3

It exactly the same :)

Import RBD

Same thing, but testing as a block device.

$ rbd import logo.png testpool/logo.png 
Importing image: 100% complete...done.
$ rbd info testpool/logo.png
rbd image 'logo.png':
    size 3898 bytes in 1 objects
    order 22 (4096 KB objects)
    block_name_prefix: rb.0.1048.2ae8944a
    format: 1

Only one object.

$ rados ls -p testpool
$ ceph osd map testpool logo.png.rbd
osdmap e14 pool 'testpool' (3) object 'logo.png.rbd' -> pg 3.d592352c (3.4) -> up [0,2] acting [0,2]

Let's go.

$ cd /var/lib/ceph/osd/ceph-0/current/3.4_head/
$ cat logo.png.rbd__head_D592352C__3
<<< Rados Block Device Image >>>

Here we can retrieve the block name prefix of the rbd 'rb.0.1048.2ae8944a' :

$ ceph osd map testpool rb.0.1048.2ae8944a.000000000000
osdmap e14 pool 'testpool' (3) object 'rb.0.1048.2ae8944a.000000000000' -> pg 3.d512078b (3.3) -> up [2,1] acting [2,1]

On ceph-03 :

$ cd /var/lib/ceph/osd/ceph-2/current/3.3_head
$ md5sum rb.0.1048.2ae8944a.000000000000__head_D512078B__3
4c7c15e856737efc0d2d71abde3c6b28  rb.0.1048.2ae8944a.000000000000__head_D512078B__3

We retrieve the file unchanged because it is not split :)

Try RBD snapshot

$ rbd snap create testpool/logo.png@snap1
$ rbd snap ls testpool/logo.png
     2 snap1 3898 bytes
$ echo "testpool/logo.png" >> /etc/ceph/rbdmap
$ service rbdmap reload
[ ok ] Starting RBD Mapping: testpool/logo.png.
[ ok ] Mounting all filesystems...done.

$ dd if=/dev/zero of=/dev/rbd/testpool/logo.png 
dd: écriture vers « /dev/rbd/testpool/logo.png »: Aucun espace disponible sur le périphérique
8+0 enregistrements lus
7+0 enregistrements écrits
3584 octets (3,6 kB) copiés, 0,285823 s, 12,5 kB/s

$ ceph osd map testpool rb.0.1048.2ae8944a.000000000000
osdmap e15 pool 'testpool' (3) object 'rb.0.1048.2ae8944a.000000000000' -> pg 3.d512078b (3.3) -> up [2,1] acting [2,1]

It's the same place on ceph-03 :

$ cd /var/lib/ceph/osd/ceph-2/current/3.3_head
$ md5sum *
4c7c15e856737efc0d2d71abde3c6b28  rb.0.1048.2ae8944a.000000000000__2_D512078B__3
dd99129a16764a6727d3314b501e9c23  rb.0.1048.2ae8944a.000000000000__head_D512078B__3

We can notice that file containing 2 (snap id 2) contain original data. And a new file has been created for the current data : head

For next tests, I will try with stripped files, rbd format 2 and snap on pool.

