Disk I/O
In this chapter, we will implement a device driver for the virtio-blk, a virtual disk device. While virtio-blk does not exist in real hardware, it shares the very same interface as a real one.
Virtio
Virtio is a device interface standard for virtual devices (virtio devices). In other words, it is one of APIs for device drivers to control devices. Like you use HTTP to access web servers, you use virtio to access virtio devices. Virtio is widely used in virtualization environments such as QEMU and Firecracker.
Virtqueue
Virtio devices have a structure called a virtqueue. As the name suggests, it is a queue shared between the driver and the device. In a nutshell:
A virtqueue consists of the following three areas:
Name | Written by | Content | Contents |
---|---|---|---|
Descriptor Area | Driver | A Table of descriptors: the address and size of the request | Memory address, length, index of the next descriptor |
Available Ring | Driver | Processing requests to the device | The head index of the descriptor chain |
Used Ring | Device | Processing requests handled by the device | The head index of the descriptor chain |
Each request (e.g., a write to disk) consists of multiple descriptors, called a descriptor chain. By splitting into multiple descriptors, you can specify scattered memory data (so-called Scatter-Gather IO) or give different descriptor attributes (whether writable by the device).
For example, when writing to a disk, virtqueue will be used as follows:
- The driver writes a read/write request in the Descriptor area.
- The driver adds the index of the head descriptor to the Available Ring.
- The driver notifies the device that there is a new request.
- The device reads a request from the Available Ring and processes it.
- The device writes the descriptor index to the Used Ring, and notifies the driver that it is complete.
For details, refer to the virtio specification. In this implementation, we will focus on a device called virtio-blk.
Enabling virtio devices
Before writing a device driver, let's prepare a test file. Create a file named lorem.txt
and fill it with some random text like the following:
$ echo "Lorem ipsum dolor sit amet, consectetur adipiscing elit. In ut magna consequat, cursus velit aliquam, scelerisque odio. Ut lorem eros, feugiat quis bibendum vitae, malesuada ac orci. Praesent eget quam non nunc fringilla cursus imperdiet non tellus. Aenean dictum lobortis turpis, non interdum leo rhoncus sed. Cras in tellus auctor, faucibus tortor ut, maximus metus. Praesent placerat ut magna non tristique. Pellentesque at nunc quis dui tempor vulputate. Vestibulum vitae massa orci. Mauris et tellus quis risus sagittis placerat. Integer lorem leo, feugiat sed molestie non, viverra a tellus." > lorem.txt
Also, attach a virtio-blk device to QEMU:
$QEMU -machine virt -bios default -nographic -serial mon:stdio --no-reboot \
-d unimp,guest_errors,int,cpu_reset -D qemu.log \
-drive id=drive0,file=lorem.txt,format=raw,if=none \ # new
-device virtio-blk-device,drive=drive0,bus=virtio-mmio-bus.0 \ # new
-kernel kernel.elf
The newly added options are as follows:
-drive id=drive0
: Defines disk nameddrive0
, withlorem.txt
as the disk image. The disk image format israw
(treats the file contents as-is as disk data).-device virtio-blk-device
: Adds a virtio-blk device with diskdrive0
.bus=virtio-mmio-bus.0
maps the device into a virtio-mmio bus (virtio over Memory Mapped I/O).
Define C macros/structs
First, let's add some virtio-related definitions to kernel.h
:
#define SECTOR_SIZE 512
#define VIRTQ_ENTRY_NUM 16
#define VIRTIO_DEVICE_BLK 2
#define VIRTIO_BLK_PADDR 0x10001000
#define VIRTIO_REG_MAGIC 0x00
#define VIRTIO_REG_VERSION 0x04
#define VIRTIO_REG_DEVICE_ID 0x08
#define VIRTIO_REG_QUEUE_SEL 0x30
#define VIRTIO_REG_QUEUE_NUM_MAX 0x34
#define VIRTIO_REG_QUEUE_NUM 0x38
#define VIRTIO_REG_QUEUE_ALIGN 0x3c
#define VIRTIO_REG_QUEUE_PFN 0x40
#define VIRTIO_REG_QUEUE_READY 0x44
#define VIRTIO_REG_QUEUE_NOTIFY 0x50
#define VIRTIO_REG_DEVICE_STATUS 0x70
#define VIRTIO_REG_DEVICE_CONFIG 0x100
#define VIRTIO_STATUS_ACK 1
#define VIRTIO_STATUS_DRIVER 2
#define VIRTIO_STATUS_DRIVER_OK 4
#define VIRTIO_STATUS_FEAT_OK 8
#define VIRTQ_DESC_F_NEXT 1
#define VIRTQ_DESC_F_WRITE 2
#define VIRTQ_AVAIL_F_NO_INTERRUPT 1
#define VIRTIO_BLK_T_IN 0
#define VIRTIO_BLK_T_OUT 1
// Virtqueue Descriptor area entry.
struct virtq_desc {
uint64_t addr;
uint32_t len;
uint16_t flags;
uint16_t next;
} __attribute__((packed));
// Virtqueue Available Ring.
struct virtq_avail {
uint16_t flags;
uint16_t index;
uint16_t ring[VIRTQ_ENTRY_NUM];
} __attribute__((packed));
// Virtqueue Used Ring entry.
struct virtq_used_elem {
uint32_t id;
uint32_t len;
} __attribute__((packed));
// Virtqueue Used Ring.
struct virtq_used {
uint16_t flags;
uint16_t index;
struct virtq_used_elem ring[VIRTQ_ENTRY_NUM];
} __attribute__((packed));
// Virtqueue.
struct virtio_virtq {
struct virtq_desc descs[VIRTQ_ENTRY_NUM];
struct virtq_avail avail;
struct virtq_used used __attribute__((aligned(PAGE_SIZE)));
int queue_index;
volatile uint16_t *used_index;
uint16_t last_used_index;
} __attribute__((packed));
// Virtio-blk request.
struct virtio_blk_req {
uint32_t type;
uint32_t reserved;
uint64_t sector;
uint8_t data[512];
uint8_t status;
} __attribute__((packed));
NOTE
__attribute__((packed))
is a compiler extension that tells the compiler to pack the struct members without padding. Otherwise, the compiler may add hidden padding bytes and driver/device may see different values.
Next, add utility functions to kernel.c
for accessing MMIO registers:
uint32_t virtio_reg_read32(unsigned offset) {
return *((volatile uint32_t *) (VIRTIO_BLK_PADDR + offset));
}
uint64_t virtio_reg_read64(unsigned offset) {
return *((volatile uint64_t *) (VIRTIO_BLK_PADDR + offset));
}
void virtio_reg_write32(unsigned offset, uint32_t value) {
*((volatile uint32_t *) (VIRTIO_BLK_PADDR + offset)) = value;
}
void virtio_reg_fetch_and_or32(unsigned offset, uint32_t value) {
virtio_reg_write32(offset, virtio_reg_read32(offset) | value);
}
WARNING
Accessing MMIO registers are not same as accessing normal memory. You should use volatile
keyword to prevent the compiler from optimizing out the read/write operations. In MMIO, memory access may trigger side effects (e.g., sending a command to the device).
Map the MMIO region
First, map the virtio-blk
MMIO region to the page table so that the kernel can access the MMIO registers. It's super simple:
struct process *create_process(const void *image, size_t image_size) {
/* omitted */
for (paddr_t paddr = (paddr_t) __kernel_base;
paddr < (paddr_t) __free_ram_end; paddr += PAGE_SIZE)
map_page(page_table, paddr, paddr, PAGE_R | PAGE_W | PAGE_X);
map_page(page_table, VIRTIO_BLK_PADDR, VIRTIO_BLK_PADDR, PAGE_R | PAGE_W); // new
Virtio device initialization
The initialization process is detailed in the virtio specification:
3.1.1 Driver Requirements: Device Initialization The driver MUST follow this sequence to initialize a device:
- Reset the device.
- Set the ACKNOWLEDGE status bit: the guest OS has noticed the device.
- Set the DRIVER status bit: the guest OS knows how to drive the device.
- Read device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it.
- Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits after this step.
- Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, the device does not support our subset of features and the device is unusable.
- Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup, reading and possibly writing the device’s virtio configuration space, and population of virtqueues.
- Set the DRIVER_OK status bit. At this point the device is “live”.
You might be overwhelmed by lengthy steps, but don't worry. A naive implementation is very simple:
struct virtio_virtq *blk_request_vq;
struct virtio_blk_req *blk_req;
paddr_t blk_req_paddr;
unsigned blk_capacity;
void virtio_blk_init(void) {
if (virtio_reg_read32(VIRTIO_REG_MAGIC) != 0x74726976)
PANIC("virtio: invalid magic value");
if (virtio_reg_read32(VIRTIO_REG_VERSION) != 1)
PANIC("virtio: invalid version");
if (virtio_reg_read32(VIRTIO_REG_DEVICE_ID) != VIRTIO_DEVICE_BLK)
PANIC("virtio: invalid device id");
// 1. Reset the device.
virtio_reg_write32(VIRTIO_REG_DEVICE_STATUS, 0);
// 2. Set the ACKNOWLEDGE status bit: the guest OS has noticed the device.
virtio_reg_fetch_and_or32(VIRTIO_REG_DEVICE_STATUS, VIRTIO_STATUS_ACK);
// 3. Set the DRIVER status bit.
virtio_reg_fetch_and_or32(VIRTIO_REG_DEVICE_STATUS, VIRTIO_STATUS_DRIVER);
// 5. Set the FEATURES_OK status bit.
virtio_reg_fetch_and_or32(VIRTIO_REG_DEVICE_STATUS, VIRTIO_STATUS_FEAT_OK);
// 7. Perform device-specific setup, including discovery of virtqueues for the device
blk_request_vq = virtq_init(0);
// 8. Set the DRIVER_OK status bit.
virtio_reg_write32(VIRTIO_REG_DEVICE_STATUS, VIRTIO_STATUS_DRIVER_OK);
// Get the disk capacity.
blk_capacity = virtio_reg_read64(VIRTIO_REG_DEVICE_CONFIG + 0) * SECTOR_SIZE;
printf("virtio-blk: capacity is %d bytes\n", blk_capacity);
// Allocate a region to store requests to the device.
blk_req_paddr = alloc_pages(align_up(sizeof(*blk_req), PAGE_SIZE) / PAGE_SIZE);
blk_req = (struct virtio_blk_req *) blk_req_paddr;
}
void kernel_main(void) {
memset(__bss, 0, (size_t) __bss_end - (size_t) __bss);
WRITE_CSR(stvec, (uint32_t) kernel_entry);
virtio_blk_init(); // new
Virtqueue initialization
Virtqueues also need to be initialized. Let's read the specification:
The virtual queue is configured as follows:
- Select the queue writing its index (first queue is 0) to QueueSel.
- Check if the queue is not already in use: read QueuePFN, expecting a returned value of zero (0x0).
- Read maximum queue size (number of elements) from QueueNumMax. If the returned value is zero (0x0) the queue is not available.
- Allocate and zero the queue pages in contiguous virtual memory, aligning the Used Ring to an optimal boundary (usually page size). The driver should choose a queue size smaller than or equal to QueueNumMax.
- Notify the device about the queue size by writing the size to QueueNum.
- Notify the device about the used alignment by writing its value in bytes to QueueAlign.
- Write the physical number of the first page of the queue to the QueuePFN register.
Here's a simple implementation:
struct virtio_virtq *virtq_init(unsigned index) {
// Allocate a region for the virtqueue.
paddr_t virtq_paddr = alloc_pages(align_up(sizeof(struct virtio_virtq), PAGE_SIZE) / PAGE_SIZE);
struct virtio_virtq *vq = (struct virtio_virtq *) virtq_paddr;
vq->queue_index = index;
vq->used_index = (volatile uint16_t *) &vq->used.index;
// 1. Select the queue writing its index (first queue is 0) to QueueSel.
virtio_reg_write32(VIRTIO_REG_QUEUE_SEL, index);
// 5. Notify the device about the queue size by writing the size to QueueNum.
virtio_reg_write32(VIRTIO_REG_QUEUE_NUM, VIRTQ_ENTRY_NUM);
// 6. Notify the device about the used alignment by writing its value in bytes to QueueAlign.
virtio_reg_write32(VIRTIO_REG_QUEUE_ALIGN, 0);
// 7. Write the physical number of the first page of the queue to the QueuePFN register.
virtio_reg_write32(VIRTIO_REG_QUEUE_PFN, virtq_paddr);
return vq;
}
This function allocates a memory region for a virtqueue, and tells the its physical address to the device. The device will use this memory region to read/write requests.
TIP
What drivers do in the initialization process is to check device capabilities/features, allocating OS resources (e.g., memory regions), and setting parameters. Isn't it similar to handshakes in network protocols?
Sending I/O requests
We now have an initialized virtio-blk device. Let's send an I/O request to the disk. I/O requests to the disk is implemented by "adding processing requests to the virtqueue" as follows:
// Notifies the device that there is a new request. `desc_index` is the index
// of the head descriptor of the new request.
void virtq_kick(struct virtio_virtq *vq, int desc_index) {
vq->avail.ring[vq->avail.index % VIRTQ_ENTRY_NUM] = desc_index;
vq->avail.index++;
__sync_synchronize();
virtio_reg_write32(VIRTIO_REG_QUEUE_NOTIFY, vq->queue_index);
vq->last_used_index++;
}
// Returns whether there are requests being processed by the device.
bool virtq_is_busy(struct virtio_virtq *vq) {
return vq->last_used_index != *vq->used_index;
}
// Reads/writes from/to virtio-blk device.
void read_write_disk(void *buf, unsigned sector, int is_write) {
if (sector >= blk_capacity / SECTOR_SIZE) {
printf("virtio: tried to read/write sector=%d, but capacity is %d\n",
sector, blk_capacity / SECTOR_SIZE);
return;
}
// Construct the request according to the virtio-blk specification.
blk_req->sector = sector;
blk_req->type = is_write ? VIRTIO_BLK_T_OUT : VIRTIO_BLK_T_IN;
if (is_write)
memcpy(blk_req->data, buf, SECTOR_SIZE);
// Construct the virtqueue descriptors (using 3 descriptors).
struct virtio_virtq *vq = blk_request_vq;
vq->descs[0].addr = blk_req_paddr;
vq->descs[0].len = sizeof(uint32_t) * 2 + sizeof(uint64_t);
vq->descs[0].flags = VIRTQ_DESC_F_NEXT;
vq->descs[0].next = 1;
vq->descs[1].addr = blk_req_paddr + offsetof(struct virtio_blk_req, data);
vq->descs[1].len = SECTOR_SIZE;
vq->descs[1].flags = VIRTQ_DESC_F_NEXT | (is_write ? 0 : VIRTQ_DESC_F_WRITE);
vq->descs[1].next = 2;
vq->descs[2].addr = blk_req_paddr + offsetof(struct virtio_blk_req, status);
vq->descs[2].len = sizeof(uint8_t);
vq->descs[2].flags = VIRTQ_DESC_F_WRITE;
// Notify the device that there is a new request.
virtq_kick(vq, 0);
// Wait until the device finishes processing.
while (virtq_is_busy(vq))
;
// virtio-blk: If a non-zero value is returned, it's an error.
if (blk_req->status != 0) {
printf("virtio: warn: failed to read/write sector=%d status=%d\n",
sector, blk_req->status);
return;
}
// For read operations, copy the data into the buffer.
if (!is_write)
memcpy(buf, blk_req->data, SECTOR_SIZE);
}
A request is sent in the following steps:
- Construct a request in
blk_req
. Specify the sector number you want to access and the type of read/write. - Construct a descriptor chain pointing to each area of
blk_req
(see below). - Add the index of the head descriptor of the descriptor chain to the Available Ring.
- Notify the device that there is a new pending request.
- Wait until the device finishes processing (aka busy-waiting or polling).
- Check the response from the device.
Here, we construct a descriptor chain consisting of 3 descriptors. We need 3 descriptors because each descriptor has different attributes (flags
) as follows:
struct virtio_blk_req {
// First descriptor: read-only from the device
uint32_t type;
uint32_t reserved;
uint64_t sector;
// Second descriptor: writable by the device if it's a read operation (VIRTQ_DESC_F_WRITE)
uint8_t data[512];
// Third descriptor: writable by the device (VIRTQ_DESC_F_WRITE)
uint8_t status;
} __attribute__((packed));
Because we busy-wait until the processing is complete every time, we can simply use the first 3 descriptors in the ring. However, in practice, you need to track free/used descriptors to process multiple requests simultaneously.
Try it out
Lastly, let's try disk I/O. Add the following code to kernel.c
:
virtio_blk_init();
char buf[SECTOR_SIZE];
read_write_disk(buf, 0, false /* read from the disk */);
printf("first sector: %s\n", buf);
strcpy(buf, "hello from kernel!!!\n");
read_write_disk(buf, 0, true /* write to the disk */);
Since we specify lorem.txt
as the (raw) disk image, its contents should be displayed as-is:
$ ./run.sh
virtio-blk: capacity is 1024 bytes
first sector: Lorem ipsum dolor sit amet, consectetur adipiscing elit ...
Also, the first sector is overwritten with the string "hello from kernel!!!":
$ head lorem.txt
hello from kernel!!!
amet, consectetur adipiscing elit ...
Congratulations! You've successfully implemented a disk I/O driver!
TIP
As you would notice, device drivers are just a "glue" between the OS and the device's. Device will do the rest of all the heavy lifting, like moving disk read/write heads. Drivers communicate with another software running on the device (e.g., firmware), not controlling the hardware directly.