openbsd-ext4/writeup/1.md

12 KiB
Raw Blame History

A quick glance

We can start at ext2fs/ext2fs.h and see what's in there.

At first sight, we are presented with the following line:

#include <sys/endian.h>

Endianness is essentially bit order. On big endian systems, 0xABCDEF would be stored as the bytes [AB, CD, EF]. On little endian systems, the bytes would be [EF, CD, AB]. This is commonly implemented for computational efficiency, with little difference to the end user. Of course, as low level developers, this is something we will have to keep in mind.

Now if we quickly glance at kernel.org, we can see the following:

All fields in ext4 are written to disk in little-endian order. HOWEVER, all fields in jbd2 (the journal) are written to disk in big-endian order.^1^

Journaling was already implemented in ext3, and the fields in the super block are implemented as incomplete features in the ext2 source. This is irrelevant to us for now.

#define BBSIZE		1024
#define SBSIZE		1024
#define	BBOFF		((off_t)(0))
#define	SBOFF		((off_t)(BBOFF + BBSIZE))

These are some standard macros to be used later in the code. All that matters is how the sizes of everything are all standardized.

The super-block is the first 1 KB of data on the disk. It contains information about what files are present, disk health information, the amount of cylinders on the disk, and other technical information that allows us to optimize reading the filesystem, instead of the OS having to analyze each disk.

The boot-block is 1 KB. ext4 allows no more than 1 KB of instructions to load up the filesystem. On MBR, the BIOS loads the first 512 bytes into memory, and on UEFI, there is a FAT32 filesystem with a bootable flag which instructs the BIOS on how to load the kernel. Those are "first-stage" bootloaders. The remainder of the 1 KB on disk is the "second-stage" bootloader. After this, you can load more bootloaders as necessary or get right into the kernel.

The following line:

#define	BBLOCK		((daddr_t)(0))
#define SBLOCK 		((daddr_t)(BBLOCK + BBSIZE / DEV_BSIZE))

defines the block address of the boot and super block in memory. Obviously, the boot block is at address 0, but the super block's address depends on how many bytes fit in each block, or 1024 over that amount. DEV_BSIZE is a constant which has not been defined at this time, so hopefully we can figure out what it is going forward 🙏

Inodes are, like in UFS, 32-bit unsigned integers and therefore ufsino_t. Disk blocks are 32-bit, if the filesystem isn't operating in 64-bit mode (the incompatible ext4 64BIT flag). More work is needed to properly use daddr_t as the disk block data type on both BE and LE architectures. XXX disk blocks are simply u_int32_t for now. say the OpenBSD developers. The only point worth noting from this is that we have to implement 64-bit mode moving forward.

#define LOG_MINBSIZE	10
#define MINBSIZE	(1 << LOG_MINBSIZE)
#define LOG_MINFSIZE	10
#define MINFSIZE	(1 << LOG_MINFSIZE)

Each block is a fragment of 1024 bytes at minimum. ext4 was likely designed with the hope that eventually block sizes would increase as disk drive storage increases, and that it would remain extensible for years to come.

#define MAXMNTLEN	512

The maximum length of a mount point is 512 characters. You can test this with mount -t ext2fs <filesystem>. Even on other systems, it's not likely to work beyond 512 bytes.

#define MINFREE		5

As the comment explains pretty well, 5% of blocks should be free. Read the comment from lines 89-100 for the full details.

struct ext2fs {
	u_int32_t  e2fs_icount;		/* Inode count */
	u_int32_t  e2fs_bcount;		/* blocks count */
	u_int32_t  e2fs_rbcount;	/* reserved blocks count */
	u_int32_t  e2fs_fbcount;	/* free blocks count */
	u_int32_t  e2fs_ficount;	/* free inodes count */
	u_int32_t  e2fs_first_dblock;	/* first data block */
	u_int32_t  e2fs_log_bsize;	/* block size = 1024*(2^e2fs_log_bsize) */
	u_int32_t  e2fs_log_fsize;	/* fragment size log2 */

This is the beginning of the actual super block. We can break down the information line by line. inodes are data structures to hold the necessary information about a file, except the file name or the actual data itself; on some filesystems, the first few blocks after the inode are reserved for the file, and then the next few blocks contain pointers to other blocks on the disk, and the next few after that contain pointers to pointers to other blocks, etc. This allows for a much larger maximum file size. Momentarily, we will see if the case is such with ext4 as well.

The amount of reserved blocks is the amount of blocks the filesystem needs for journaling, etc. This would depend on what version of the ext filesystem it is, and so forth.

The block size is stored as powers of 2, so the e2fs_log_xsize hold just that.

	u_int32_t  e2fs_bpg;		/* blocks per group */
	u_int32_t  e2fs_fpg;		/* frags per group */
	u_int32_t  e2fs_ipg;		/* inodes per group */
	u_int32_t  e2fs_mtime;		/* mount time */
	u_int32_t  e2fs_wtime;		/* write time */
	u_int16_t  e2fs_mnt_count;	/* mount count */
	u_int16_t  e2fs_max_mnt_count;	/* max mount count */
	u_int16_t  e2fs_magic;		/* magic number */
	u_int16_t  e2fs_state;		/* file system state */
	u_int16_t  e2fs_beh;		/* behavior on errors */

These lines are quite self explanatory; the only thing worth pointing out is that a group is an amount of bytes that can be read easily by the disk- exact amounts depend on whether its solid state, spinning disk, how old it is, etc. The mount and write time are stored as unix timestamps, and are held to check if the current timestamp match the recorded time, checking if the filesystem was modified externally. This is also useful in journal replay, so we know which transactions to start from. The last 3 are flags/values to differentiate filesystems.

	u_int16_t  e2fs_minrev;		/* minor revision level */

The minimum revision level that the filesystem needs- for example, if it's ext4, then the e2fs_minrev field would guarantee that it needs at least the features of ext4 (or if its 32-bit, ext3) to be read and mounted

	u_int32_t  e2fs_lastfsck;	/* time of last fsck */
	u_int32_t  e2fs_fsckintv;	/* max time between fscks */
	u_int32_t  e2fs_creator;	/* creator OS */
	u_int32_t  e2fs_rev;		/* revision level */

This is to ensure fsck(3) works properly.

fsck filesystem consistency check and interactive repair As my manual page likes to call it. This is to replay journaling, or whatever else the filesystem requires to be cleaned up.

	u_int16_t  e2fs_ruid;		/* default uid for reserved blocks */
	u_int16_t  e2fs_rgid;		/* default gid for reserved blocks */

The default user and group ID that processes need to have to access reserved blocks. As there isn't much information available online, let's just keep going for now.

	/* EXT2_DYNAMIC_REV superblocks */
	u_int32_t  e2fs_first_ino;	/* first non-reserved inode */
	u_int16_t  e2fs_inode_size;	/* size of inode structure */
	u_int16_t  e2fs_block_group_nr;	/* block grp number of this sblk*/
	u_int32_t  e2fs_features_compat; /*  compatible feature set */
	u_int32_t  e2fs_features_incompat; /* incompatible feature set */
	u_int32_t  e2fs_features_rocompat; /* RO-compatible feature set */
	u_int8_t   e2fs_uuid[16];	/* 128-bit uuid for volume */
	char       e2fs_vname[16];	/* volume name */
	char       e2fs_fsmnt[64];	/* name mounted on */
	u_int32_t  e2fs_algo;		/* For compression */
	u_int8_t   e2fs_prealloc;	/* # of blocks to preallocate */
	u_int8_t   e2fs_dir_prealloc;	/* # of blocks to preallocate for dir */
	u_int16_t  e2fs_reserved_ngdb;	/* # of reserved gd blocks for resize */

These look like what are meant to be the other part of the superblock if the revision number is 3 or higher. Probably added after ext2 support was initially added. The feature set are flags to be defined later. The UUID, name, where it's mounted, algorithm or whatever are informational flags to remain consistent, and this likely means ext2 didn't have these features. e2fs_prealloc must be for each file. gd likely stands for group descriptor. With that in mind, let's keep moving down the superblock.

	/* Ext3 JBD2 journaling. */
	u_int8_t   e2fs_journal_uuid[16];
	u_int32_t  e2fs_journal_ino;
	u_int32_t  e2fs_journal_dev;
	u_int32_t  e2fs_last_orphan;	/* start of list of inodes to delete */
	u_int32_t  e2fs_hash_seed[4];	/* htree hash seed */

So it turns out a journal has an identifier, an inode number, also a unique device id? a list of inodes which for some reason is u_int32_t and a hash seed which is an array 💀

e2fs_last_orphan is the block number of the list of inodes to delete. The hash seed is an array of 4 u_int32_t, which is defined to have a primary and secondary hashing function for improved collision resistance.

	u_int8_t   e2fs_def_hash_version;
	u_int8_t   e2fs_journal_backup_type;
	u_int16_t  e2fs_gdesc_size;
	u_int32_t  e2fs_default_mount_opts;
	u_int32_t  e2fs_first_meta_bg;
	u_int32_t  e2fs_mkfs_time;

As stated before, information is scare 😅 For now, we will do our best by trying to pick apart what we see and fully understand what's going on later in the code hopefully, or once we get in touch with other developers.

The first value seems to be the default hash version, so some set of flags that indicate what hash function's versions are used. Maybe if the e2fs_hash_seed is sha, this would specify between 256, 512, and so forth.

The second value is the type of backup journaling. It must again be a set of flags determining what items are recorded to be replayed.

The group descriptor must be information specific to each group, so its size must imply that it's held in a fixed position on disk and has a size dependent on the amount of metadata we want to hold.

The e2fs_first_meta_bg appears to be the first metadata block group in case the simplifications were cryptic. It contains other important information not specified in the superblock already.

The last value holds when the filesystem was created--when mkfs was last run-- which is useful metadata information but doesn't appear to be directly useful, unless we are checking data integrity.

	u_int32_t  e2fs_journal_backup[17];

I could not find any information on this field so far. We want to allocate 17 double words for information the journal backs up? Or is this where the logs are stored (but wouldn't that be too small?) we don't know yet.

	u_int32_t  e2fs_bcount_hi;	/* high bits of blocks count */
	u_int32_t  e2fs_rbcount_hi;	/* high bits of reserved blocks count */
	u_int32_t  e2fs_fbcount_hi;	/* high bits of free blocks count */
	u_int16_t  e2fs_min_extra_isize; /* all inodes have some bytes */
	u_int16_t  e2fs_want_extra_isize;/* inodes must reserve some bytes */

While these have comments, they probably make no sense to the non kernel developer 🤣

At this point I think it is safe to say this is beyond the journaling part of the superblock, even though there is no padding or comment to indicate otherwise. The high bits mean the more significant bits, so it's easier to access, even though these are u_int32_t like the regular counts 🤔 must be more useful in 64-bit mode, I guess.

The last two fields are quiet confusing, comment wise. However all they mean is that there is a minimum amount of bytes an inode should allocate, and an ideal number.

	u_int32_t  e2fs_flags;		/* miscellaneous flags */
	u_int16_t  e2fs_raid_stride;	/* RAID stride */
	u_int16_t  e2fs_mmpintv;		/* seconds to wait in MMP checking */
	u_int64_t  e2fs_mmpblk;		/* block for multi-mount protection */
	u_int32_t  e2fs_raid_stripe_wid; /* blocks on data disks (N * stride) */

These are other helpful utilities. RAID means Redundant Array of Independent Disks. The number of blocks on each disk is the raid_stride times the raid_stripe_width: in simple words, the amount of duplicate copies of the disk you have.

MMP is multi-mount protection: a block holds advanced information for that, and there are a few seconds you need to wait. The last item says how many blocks there are on each data disk to be multiplied with e2fs_raid_stride.

	u_int8_t   e2fs_log_gpf;		/* FLEX_BG group size */
	u_int8_t   e2fs_chksum_type;	/* metadata checksum algorithm used */
	u_int8_t   e2fs_encrypt;		/* versioning level for encryption */
	u_int8_t   e2fs_reserved_pad;

The values seem to get more and more disorganized as we reach the end of the block