Disk I/O
In this chapter, we will implement a device driver for the virtio-blk, a virtual disk device. While virtio-blk does not exist in real hardware, it shares the very same interface as a real one.
Virtio
Virtio is a device interface standard for virtual devices (virtio devices). In other words, it is one of the APIs for device drivers to control devices. Like you use HTTP to access web servers, you use virtio to access virtio devices. Virtio is widely used in virtualization environments such as QEMU and Firecracker.
NOTE
The latest Virtio specification defines 2 interfaces: Legacy and Modern. In this implementation, we use the legacy interface because it's slightly simpler and is not significantly different from the modern one.
Refer to the legacy PDF, search for sections starting with Legacy Interface: in the latest HTML.
Virtqueue
Virtio devices have a structure called a virtqueue. As the name suggests, it is a queue shared between the driver and the device. In a nutshell:
A virtqueue consists of the following three areas:
| Name | Written by | Content | Contents |
|---|---|---|---|
| Descriptor Table | Driver | A table of descriptors: the address and size of the request | Memory address, length, index of the next descriptor |
| Available Ring | Driver | Processing requests to the device | The head index of the descriptor chain |
| Used Ring | Device | Processing requests handled by the device | The head index of the descriptor chain |
Each request (e.g., a write to disk) consists of multiple descriptors, called a descriptor chain. By splitting into multiple descriptors, you can specify scattered memory data (so-called Scatter-Gather IO) or give different descriptor attributes (whether writable by the device).
For example, when writing to a disk, virtqueue will be used as follows:
- The driver writes a read/write request in the Descriptor Table.
- The driver adds the index of the head descriptor to the Available Ring.
- The driver notifies the device that there is a new request.
- The device reads a request from the Available Ring and processes it.
- The device writes the descriptor index to the Used Ring, and notifies the driver that it is complete.
For details, refer to the virtio specification. In this implementation, we will focus on a device called virtio-blk.
Enabling virtio devices
Before writing a device driver, let's prepare a test file. Create a file named lorem.txt and fill it with some random text like the following:
$ echo "Lorem ipsum dolor sit amet, consectetur adipiscing elit. In ut magna consequat, cursus velit aliquam, scelerisque odio. Ut lorem eros, feugiat quis bibendum vitae, malesuada ac orci. Praesent eget quam non nunc fringilla cursus imperdiet non tellus. Aenean dictum lobortis turpis, non interdum leo rhoncus sed. Cras in tellus auctor, faucibus tortor ut, maximus metus. Praesent placerat ut magna non tristique. Pellentesque at nunc quis dui tempor vulputate. Vestibulum vitae massa orci. Mauris et tellus quis risus sagittis placerat. Integer lorem leo, feugiat sed molestie non, viverra a tellus." > lorem.txtAlso, attach a virtio-blk device to QEMU:
$QEMU -machine virt -bios default -nographic -serial mon:stdio --no-reboot \
-d unimp,guest_errors,int,cpu_reset -D qemu.log \
-drive id=drive0,file=lorem.txt,format=raw,if=none \ # new
-device virtio-blk-device,drive=drive0,bus=virtio-mmio-bus.0 \ # new
-kernel kernel.elfThe newly added options are as follows:
-drive id=drive0: Defines disk nameddrive0, withlorem.txtas the disk image. The disk image format israw(treats the file contents as-is as disk data).-device virtio-blk-device: Adds a virtio-blk device with diskdrive0.bus=virtio-mmio-bus.0maps the device into a virtio-mmio bus (virtio over Memory Mapped I/O).
Define C macros/structs
First, let's add some virtio-related definitions to kernel.h:
#define SECTOR_SIZE 512
#define VIRTQ_ENTRY_NUM 16
#define VIRTIO_DEVICE_BLK 2
#define VIRTIO_BLK_PADDR 0x10001000
#define VIRTIO_REG_MAGIC 0x00
#define VIRTIO_REG_VERSION 0x04
#define VIRTIO_REG_DEVICE_ID 0x08
#define VIRTIO_REG_PAGE_SIZE 0x28
#define VIRTIO_REG_QUEUE_SEL 0x30
#define VIRTIO_REG_QUEUE_NUM_MAX 0x34
#define VIRTIO_REG_QUEUE_NUM 0x38
#define VIRTIO_REG_QUEUE_PFN 0x40
#define VIRTIO_REG_QUEUE_READY 0x44
#define VIRTIO_REG_QUEUE_NOTIFY 0x50
#define VIRTIO_REG_DEVICE_STATUS 0x70
#define VIRTIO_REG_DEVICE_CONFIG 0x100
#define VIRTIO_STATUS_ACK 1
#define VIRTIO_STATUS_DRIVER 2
#define VIRTIO_STATUS_DRIVER_OK 4
#define VIRTQ_DESC_F_NEXT 1
#define VIRTQ_DESC_F_WRITE 2
#define VIRTQ_AVAIL_F_NO_INTERRUPT 1
#define VIRTIO_BLK_T_IN 0
#define VIRTIO_BLK_T_OUT 1
// Virtqueue Descriptor Table entry.
struct virtq_desc {
uint64_t addr;
uint32_t len;
uint16_t flags;
uint16_t next;
} __attribute__((packed));
// Virtqueue Available Ring.
struct virtq_avail {
uint16_t flags;
uint16_t index;
uint16_t ring[VIRTQ_ENTRY_NUM];
} __attribute__((packed));
// Virtqueue Used Ring entry.
struct virtq_used_elem {
uint32_t id;
uint32_t len;
} __attribute__((packed));
// Virtqueue Used Ring.
struct virtq_used {
uint16_t flags;
uint16_t index;
struct virtq_used_elem ring[VIRTQ_ENTRY_NUM];
} __attribute__((packed));
// Virtqueue.
struct virtio_virtq {
struct virtq_desc descs[VIRTQ_ENTRY_NUM];
struct virtq_avail avail;
struct virtq_used used __attribute__((aligned(PAGE_SIZE)));
int queue_index;
volatile uint16_t *used_index;
uint16_t last_used_index;
} __attribute__((packed));
// Virtio-blk request.
struct virtio_blk_req {
uint32_t type;
uint32_t reserved;
uint64_t sector;
uint8_t data[512];
uint8_t status;
} __attribute__((packed));NOTE
__attribute__((packed)) is a compiler extension that tells the compiler to pack the struct members without padding. Otherwise, the compiler may add hidden padding bytes and driver/device may see different values.
Next, add utility functions to kernel.c for accessing MMIO registers:
uint32_t virtio_reg_read32(unsigned offset) {
return *((volatile uint32_t *) (VIRTIO_BLK_PADDR + offset));
}
uint64_t virtio_reg_read64(unsigned offset) {
return *((volatile uint64_t *) (VIRTIO_BLK_PADDR + offset));
}
void virtio_reg_write32(unsigned offset, uint32_t value) {
*((volatile uint32_t *) (VIRTIO_BLK_PADDR + offset)) = value;
}
void virtio_reg_fetch_and_or32(unsigned offset, uint32_t value) {
virtio_reg_write32(offset, virtio_reg_read32(offset) | value);
}WARNING
Accessing MMIO registers are not same as accessing normal memory. You should use volatile keyword to prevent the compiler from optimizing out the read/write operations. In MMIO, memory access may trigger side effects (e.g., sending a command to the device).
Map the MMIO region
First, map the virtio-blk MMIO region to the page table so that the kernel can access the MMIO registers. It's super simple:
struct process *create_process(const void *image, size_t image_size) {
/* omitted */
for (paddr_t paddr = (paddr_t) __kernel_base;
paddr < (paddr_t) __free_ram_end; paddr += PAGE_SIZE)
map_page(page_table, paddr, paddr, PAGE_R | PAGE_W | PAGE_X);
map_page(page_table, VIRTIO_BLK_PADDR, VIRTIO_BLK_PADDR, PAGE_R | PAGE_W); // newVirtio device initialization
The initialization process is described in the spec as follows:
- Reset the device. This is not required on initial start up.
- The ACKNOWLEDGE status bit is set: we have noticed the device.
- The DRIVER status bit is set: we know how to drive the device.
- Device-specific setup, including reading the Device Feature Bits, discovery of virtqueues for the device, optional MSI-X setup, and reading and possibly writing the virtio configuration space.
- The subset of Device Feature Bits understood by the driver is written to the device.
- The DRIVER_OK status bit is set.
You might be overwhelmed by lengthy steps, but don't worry. A naive implementation is very simple:
struct virtio_virtq *blk_request_vq;
struct virtio_blk_req *blk_req;
paddr_t blk_req_paddr;
uint64_t blk_capacity;
void virtio_blk_init(void) {
if (virtio_reg_read32(VIRTIO_REG_MAGIC) != 0x74726976)
PANIC("virtio: invalid magic value");
if (virtio_reg_read32(VIRTIO_REG_VERSION) != 1)
PANIC("virtio: invalid version");
if (virtio_reg_read32(VIRTIO_REG_DEVICE_ID) != VIRTIO_DEVICE_BLK)
PANIC("virtio: invalid device id");
// 1. Reset the device.
virtio_reg_write32(VIRTIO_REG_DEVICE_STATUS, 0);
// 2. Set the ACKNOWLEDGE status bit: We found the device.
virtio_reg_fetch_and_or32(VIRTIO_REG_DEVICE_STATUS, VIRTIO_STATUS_ACK);
// 3. Set the DRIVER status bit: We know how to use the device.
virtio_reg_fetch_and_or32(VIRTIO_REG_DEVICE_STATUS, VIRTIO_STATUS_DRIVER);
// Set our page size: We use 4KB pages. This defines PFN (page frame number) calculation.
virtio_reg_write32(VIRTIO_REG_PAGE_SIZE, PAGE_SIZE);
// Initialize a queue for disk read/write requests.
blk_request_vq = virtq_init(0);
// 6. Set the DRIVER_OK status bit: We can now use the device!
virtio_reg_write32(VIRTIO_REG_DEVICE_STATUS, VIRTIO_STATUS_DRIVER_OK);
// Get the disk capacity.
blk_capacity = virtio_reg_read64(VIRTIO_REG_DEVICE_CONFIG + 0) * SECTOR_SIZE;
printf("virtio-blk: capacity is %d bytes\n", (int)blk_capacity);
// Allocate a region to store requests to the device.
blk_req_paddr = alloc_pages(align_up(sizeof(*blk_req), PAGE_SIZE) / PAGE_SIZE);
blk_req = (struct virtio_blk_req *) blk_req_paddr;
}void kernel_main(void) {
memset(__bss, 0, (size_t) __bss_end - (size_t) __bss);
WRITE_CSR(stvec, (uint32_t) kernel_entry);
virtio_blk_init(); // newThis is a typical initialization pattern for device drivers. Reset the device, set parameters, then enable the device. We, operating systems, do not have to care about what actually happens inside the device. Just do few memory read/write operations like above.
Virtqueue initialization
A virtqueue should be initialized as follows:
- Write the virtqueue index (first queue is 0) to the Queue Select field.
- Read the virtqueue size from the Queue Size field, which is always a power of 2. This controls how big the virtqueue is (see below). If this field is 0, the virtqueue does not exist.
- Allocate and zero virtqueue in contiguous physical memory, on a 4096 byte alignment. Write the physical address, divided by 4096 to the Queue Address field.
Here's a simple implementation:
struct virtio_virtq *virtq_init(unsigned index) {
// Allocate a region for the virtqueue.
paddr_t virtq_paddr = alloc_pages(align_up(sizeof(struct virtio_virtq), PAGE_SIZE) / PAGE_SIZE);
struct virtio_virtq *vq = (struct virtio_virtq *) virtq_paddr;
vq->queue_index = index;
vq->used_index = (volatile uint16_t *) &vq->used.index;
// Select the queue: Write the virtqueue index (first queue is 0).
virtio_reg_write32(VIRTIO_REG_QUEUE_SEL, index);
// Specify the queue size: Write the # of descriptors we'll use.
virtio_reg_write32(VIRTIO_REG_QUEUE_NUM, VIRTQ_ENTRY_NUM);
// Write the physical page frame number (not physical address!) of the queue.
virtio_reg_write32(VIRTIO_REG_QUEUE_PFN, virtq_paddr / PAGE_SIZE);
return vq;
}This function allocates a memory region for a virtqueue, and tells the its physical address to the device. The device will use this memory region to read/write requests.
TIP
What drivers do in the initialization process is to check device capabilities/features, allocating OS resources (e.g., memory regions), and setting parameters. Isn't it similar to handshakes in network protocols?
Sending I/O requests
We now have an initialized virtio-blk device. Let's send an I/O request to the disk. I/O requests to the disk is implemented by "adding processing requests to the virtqueue" as follows:
// Notifies the device that there is a new request. `desc_index` is the index
// of the head descriptor of the new request.
void virtq_kick(struct virtio_virtq *vq, int desc_index) {
vq->avail.ring[vq->avail.index % VIRTQ_ENTRY_NUM] = desc_index;
vq->avail.index++;
__sync_synchronize();
virtio_reg_write32(VIRTIO_REG_QUEUE_NOTIFY, vq->queue_index);
vq->last_used_index++;
}
// Returns whether there are requests being processed by the device.
bool virtq_is_busy(struct virtio_virtq *vq) {
return vq->last_used_index != *vq->used_index;
}
// Reads/writes from/to virtio-blk device.
void read_write_disk(void *buf, unsigned sector, int is_write) {
if (sector >= blk_capacity / SECTOR_SIZE) {
printf("virtio: tried to read/write sector=%d, but capacity is %d\n",
sector, blk_capacity / SECTOR_SIZE);
return;
}
// Construct the request according to the virtio-blk specification.
blk_req->sector = sector;
blk_req->type = is_write ? VIRTIO_BLK_T_OUT : VIRTIO_BLK_T_IN;
if (is_write)
memcpy(blk_req->data, buf, SECTOR_SIZE);
// Construct the virtqueue descriptors (using 3 descriptors).
struct virtio_virtq *vq = blk_request_vq;
vq->descs[0].addr = blk_req_paddr;
vq->descs[0].len = sizeof(uint32_t) * 2 + sizeof(uint64_t);
vq->descs[0].flags = VIRTQ_DESC_F_NEXT;
vq->descs[0].next = 1;
vq->descs[1].addr = blk_req_paddr + offsetof(struct virtio_blk_req, data);
vq->descs[1].len = SECTOR_SIZE;
vq->descs[1].flags = VIRTQ_DESC_F_NEXT | (is_write ? 0 : VIRTQ_DESC_F_WRITE);
vq->descs[1].next = 2;
vq->descs[2].addr = blk_req_paddr + offsetof(struct virtio_blk_req, status);
vq->descs[2].len = sizeof(uint8_t);
vq->descs[2].flags = VIRTQ_DESC_F_WRITE;
// Notify the device that there is a new request.
virtq_kick(vq, 0);
// Wait until the device finishes processing.
while (virtq_is_busy(vq))
;
// virtio-blk: If a non-zero value is returned, it's an error.
if (blk_req->status != 0) {
printf("virtio: warn: failed to read/write sector=%d status=%d\n",
sector, blk_req->status);
return;
}
// For read operations, copy the data into the buffer.
if (!is_write)
memcpy(buf, blk_req->data, SECTOR_SIZE);
}A request is sent in the following steps:
- Construct a request in
blk_req. Specify the sector number you want to access and the type of read/write. - Construct a descriptor chain pointing to each area of
blk_req(see below). - Add the index of the head descriptor of the descriptor chain to the Available Ring.
- Notify the device that there is a new pending request.
- Wait until the device finishes processing (aka busy-waiting or polling).
- Check the response from the device.
Here, we construct a descriptor chain consisting of 3 descriptors. We need 3 descriptors because each descriptor has different attributes (flags) as follows:
struct virtio_blk_req {
// First descriptor: read-only from the device
uint32_t type;
uint32_t reserved;
uint64_t sector;
// Second descriptor: writable by the device if it's a read operation (VIRTQ_DESC_F_WRITE)
uint8_t data[512];
// Third descriptor: writable by the device (VIRTQ_DESC_F_WRITE)
uint8_t status;
} __attribute__((packed));Because we busy-wait until the processing is complete every time, we can simply use the first 3 descriptors in the ring. However, in practice, you need to track free/used descriptors to process multiple requests simultaneously.
Try it out
Lastly, let's try disk I/O. Add the following code to kernel.c:
virtio_blk_init();
char buf[SECTOR_SIZE];
read_write_disk(buf, 0, false /* read from the disk */);
printf("first sector: %s\n", buf);
strcpy(buf, "hello from kernel!!!\n");
read_write_disk(buf, 0, true /* write to the disk */);Since we specify lorem.txt as the (raw) disk image, its contents should be displayed as-is:
$ ./run.sh
virtio-blk: capacity is 1024 bytes
first sector: Lorem ipsum dolor sit amet, consectetur adipiscing elit ...Also, the first sector is overwritten with the string "hello from kernel!!!":
$ head lorem.txt
hello from kernel!!!
amet, consectetur adipiscing elit ...Congratulations! You've successfully implemented a disk I/O driver!
TIP
As you might notice, device drivers are just "glue" between the OS and devices. Drivers don't control the hardware directly; drivers communicate with other software running on the device (e.g., firmware). Devices and their software, not the OS driver, will do the rest of the heavy lifting, like moving disk read/write heads.