[clue] btrfs

Dennis J Perkins dennisjperkins at comcast.net
Sat Mar 9 14:39:01 MST 2019


Sean, I've learned more about btrfs.  Btrfs is meant to be a modern
file system that can handle large amounts of data in the exabyte range,
epandable using is own lvm, and with an emphasis on reliability. 
Reliability is done by having checksums on each data and metadata
block, and if an error is detected, hopefully the mirrored block is
good.  Mirroring is done by RAID.  It doesn't use mdadm for RAID
because it needs to be able to access the mirrored copy if the data
block is bad and mdadm RAID doesn't give it access.

Btrfs's logical volume management is different from LVM because you
can't set volume size.  You simply add a drive and btrfs incorporates
it into the file system.  This feels like the JBOD that Grayhole
offers.

By default btrfs creates two copies of the metadata and one copy of the
data if there is only one drive, but you can disable the metadata copy.

If you only have a single SSD, trfs has one copy of the data and
metadata.

If you have two drives, the default is to mirror the metadata and
stripe the data.  You can change this independently for metadata and
data by using the -m and -d options and specirying raid0 (stripe) or
raid1 (mirror).

If you have more than two drives, you need to understand that mirroring
puts the data or metadata on two drives only.  Striping goes across
several drives, but I don't know what the maximum number is.  If
mirroring is selected, btrfs does its best to spread the data onto all
ofthe drives, but each data block is only on two drives.  See the
diagram in the previous email to see how three drives can be used.  The
drives don't need to be the same size, but some of the drive or drives
might not be used.

If you just want as much data space as possible, and you don't care
about striping or mirroring, you can set -m and -d to single, and the
drives will look like one large drive.


You can scrub the filesystem manually or periodically to fix errors. 
It will check every data and metadata block for checksum errors, it
will replace that block with the mirrored block unless it is also bad. 
I assume it lso checks each mirrored block for errors, but I didn't
find confirmation.


There are some optimizations when using SSDs, like avoiding unnecessary
seek optimizations, and writing in clusters, even if the writes are for
separate files.  This results in greater throughput but more seeks
later on.


You can format an entire drive without stting up a partition first. 
Some people advise against this but I haven't seen any reason given
other than btrfs might not be properly aligned.  I don't know if this
is true because it probably knows how to calculate alignment.  If you
are going to use a boot or a swap partition, you will need to partition
the drive.

At the moment, swap files are not recommended, but this might change.

It's also not recommended to put a database or vitual machines in btrfs
unless the file or its directory are set up to not do Copy on Write.



On Sun, 2019-02-24 at 21:59 -0700, Dennis J Perkins wrote:
> Sean, here is what btrfs offers, assuming I haven't misunderstood
> something.  You can compare this against ZFS, since you are using it.
> 
> Btrfs tries to ensure data integrity.  I know that some bugs suggest
> the opposite.  Checksums for files or datablocks (I'm not sure which)
> are stored with the metadata.  If an error is detected and RAID 1,
> the
> other copy of the data is checked and if it is good, it can be copied
> to the first drive to replace the bad file or block.  I don't know if
> it overwrites the original or writes to a different location.  I
> don't
> know how it handles an error with RAID 10.
> 
> File compression is enabled by default.  It can be turned off, or you
> can select compression algorithms.  I think it can be configured to
> not
> compress a file if it is already compressed.
> 
> It uses copy on write (CoW) instead of journalling.  This allows
> snapshots if you have set up subvolumes.  It is supposed to also
> impreve the life of an SSD because there is no journal to write to.
> 
> If you sete up subvolumes, you can create snapshots very quickly
> because you are not copying files.  The link structure is copied
> instead.  If you then modify a file, CoW is used on the file in the
> subvolume.  The block can't be deleted because the snapshot is
> pointing
> to it.  You can use the snapshot to back up the subvolume or to
> restore
> it.
> 
> CoW causes fragmentation, but autodefragmentation is available.  I
> don't know if trim would be a better choice when using SSD's.
> 
> Btrfs has its own volume manager and RAID.  They don't work quite the
> same as LVM or mdadm.  Volume management lets you create a pool.  Add
> a
> drive and the pool gets larger. You can't specify the size of a
> volume
> but I don't know why you would need to.
> 
> Subvolumes are kind of like partitions, but they can grow or shrink.
> 
> RAID 0, 1, and 10 are supported.  RAID 5 and 6 should not be used 
> because they are still working on a solution to the write hole
> problem.
> 
> I don't know about regular RAID, but RAID 1 in btrfs has two copies
> of
> each file.  If the drives in the pool are different sizes, btrfs
> handles making sure that all data is on two drives, but not
> necessarily
> the same drives.  For example, if you have three 2 TB drives, you
> have
> 3 TB of useful space with data spread on all three drives.
> 
> 
> +-------+              +----------+                +---------+
> >       |              |          |                |         | 
> > 1TB   |------------->|    1TB   |     +--------->|   1TB   |
> >       |              |          |     |          |         |    
> +-------+              +----------+     |          +---------+ 
> > 1TB   |----+         |    1TB   |-----+          |         |    
> >       |    |         |          |         +----->|  1TB    |    
> +-------+    |         +----------+         |	   |         | 
>              +------------------------------+      +---------+ 
> 
> 
> If the drives are different sizes, sometimes not all of a drive in
> the
> pool  will be used.



More information about the clue mailing list