block, bfq: update to latest bfq-v8-v4.4 state
BFQ-v8r12 up to 887cf43acdb1d5415fa678e4a63be8fe1bab2d3a Change-Id: I4725397969026ff9fa969d598c4378f24800c31d Signed-off-by: Alexander Martinz <alex@amartinz.at>
This commit is contained in:
parent
6b7d5107ba
commit
833ce657e4
7 changed files with 4706 additions and 2280 deletions
|
@ -1,5 +1,7 @@
|
|||
00-INDEX
|
||||
- This file
|
||||
bfq-iosched.txt
|
||||
- BFQ IO scheduler and its tunables
|
||||
biodoc.txt
|
||||
- Notes on the Generic Block Layer Rewrite in Linux 2.5
|
||||
capability.txt
|
||||
|
|
545
Documentation/block/bfq-iosched.txt
Normal file
545
Documentation/block/bfq-iosched.txt
Normal file
|
@ -0,0 +1,545 @@
|
|||
BFQ (Budget Fair Queueing)
|
||||
==========================
|
||||
|
||||
BFQ is a proportional-share I/O scheduler, with some extra
|
||||
low-latency capabilities. In addition to cgroups support (blkio or io
|
||||
controllers), BFQ's main features are:
|
||||
- BFQ guarantees a high system and application responsiveness, and a
|
||||
low latency for time-sensitive applications, such as audio or video
|
||||
players;
|
||||
- BFQ distributes bandwidth, and not just time, among processes or
|
||||
groups (switching back to time distribution when needed to keep
|
||||
throughput high).
|
||||
|
||||
In its default configuration, BFQ privileges latency over
|
||||
throughput. So, when needed for achieving a lower latency, BFQ builds
|
||||
schedules that may lead to a lower throughput. If your main or only
|
||||
goal, for a given device, is to achieve the maximum-possible
|
||||
throughput at all times, then do switch off all low-latency heuristics
|
||||
for that device, by setting low_latency to 0. Full details in Section 3.
|
||||
|
||||
On average CPUs, the current version of BFQ can handle devices
|
||||
performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
|
||||
reference, 30-50 KIOPS correspond to very high bandwidths with
|
||||
sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
|
||||
to 120-200 MB/s with 4KB random I/O.
|
||||
|
||||
The table of contents follow. Impatients can just jump to Section 3.
|
||||
|
||||
CONTENTS
|
||||
|
||||
1. When may BFQ be useful?
|
||||
1-1 Personal systems
|
||||
1-2 Server systems
|
||||
2. How does BFQ work?
|
||||
3. What are BFQ's tunable?
|
||||
4. BFQ group scheduling
|
||||
4-1 Service guarantees provided
|
||||
4-2 Interface
|
||||
|
||||
1. When may BFQ be useful?
|
||||
==========================
|
||||
|
||||
BFQ provides the following benefits on personal and server systems.
|
||||
|
||||
1-1 Personal systems
|
||||
--------------------
|
||||
|
||||
Low latency for interactive applications
|
||||
|
||||
Regardless of the actual background workload, BFQ guarantees that, for
|
||||
interactive tasks, the storage device is virtually as responsive as if
|
||||
it was idle. For example, even if one or more of the following
|
||||
background workloads are being executed:
|
||||
- one or more large files are being read, written or copied,
|
||||
- a tree of source files is being compiled,
|
||||
- one or more virtual machines are performing I/O,
|
||||
- a software update is in progress,
|
||||
- indexing daemons are scanning filesystems and updating their
|
||||
databases,
|
||||
starting an application or loading a file from within an application
|
||||
takes about the same time as if the storage device was idle. As a
|
||||
comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
|
||||
applications experience high latencies, or even become unresponsive
|
||||
until the background workload terminates (also on SSDs).
|
||||
|
||||
Low latency for soft real-time applications
|
||||
|
||||
Also soft real-time applications, such as audio and video
|
||||
players/streamers, enjoy a low latency and a low drop rate, regardless
|
||||
of the background I/O workload. As a consequence, these applications
|
||||
do not suffer from almost any glitch due to the background workload.
|
||||
|
||||
Higher speed for code-development tasks
|
||||
|
||||
If some additional workload happens to be executed in parallel, then
|
||||
BFQ executes the I/O-related components of typical code-development
|
||||
tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
|
||||
NOOP or DEADLINE.
|
||||
|
||||
High throughput
|
||||
|
||||
On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
|
||||
up to 150% higher throughput than DEADLINE and NOOP, with all the
|
||||
sequential workloads considered in our tests. With random workloads,
|
||||
and with all the workloads on flash-based devices, BFQ achieves,
|
||||
instead, about the same throughput as the other schedulers.
|
||||
|
||||
Strong fairness, bandwidth and delay guarantees
|
||||
|
||||
BFQ distributes the device throughput, and not just the device time,
|
||||
among I/O-bound applications in proportion their weights, with any
|
||||
workload and regardless of the device parameters. From these bandwidth
|
||||
guarantees, it is possible to compute tight per-I/O-request delay
|
||||
guarantees by a simple formula. If not configured for strict service
|
||||
guarantees, BFQ switches to time-based resource sharing (only) for
|
||||
applications that would otherwise cause a throughput loss.
|
||||
|
||||
1-2 Server systems
|
||||
------------------
|
||||
|
||||
Most benefits for server systems follow from the same service
|
||||
properties as above. In particular, regardless of whether additional,
|
||||
possibly heavy workloads are being served, BFQ guarantees:
|
||||
|
||||
. audio and video-streaming with zero or very low jitter and drop
|
||||
rate;
|
||||
|
||||
. fast retrieval of WEB pages and embedded objects;
|
||||
|
||||
. real-time recording of data in live-dumping applications (e.g.,
|
||||
packet logging);
|
||||
|
||||
. responsiveness in local and remote access to a server.
|
||||
|
||||
|
||||
2. How does BFQ work?
|
||||
=====================
|
||||
|
||||
BFQ is a proportional-share I/O scheduler, whose general structure,
|
||||
plus a lot of code, are borrowed from CFQ.
|
||||
|
||||
- Each process doing I/O on a device is associated with a weight and a
|
||||
(bfq_)queue.
|
||||
|
||||
- BFQ grants exclusive access to the device, for a while, to one queue
|
||||
(process) at a time, and implements this service model by
|
||||
associating every queue with a budget, measured in number of
|
||||
sectors.
|
||||
|
||||
- After a queue is granted access to the device, the budget of the
|
||||
queue is decremented, on each request dispatch, by the size of the
|
||||
request.
|
||||
|
||||
- The in-service queue is expired, i.e., its service is suspended,
|
||||
only if one of the following events occurs: 1) the queue finishes
|
||||
its budget, 2) the queue empties, 3) a "budget timeout" fires.
|
||||
|
||||
- The budget timeout prevents processes doing random I/O from
|
||||
holding the device for too long and dramatically reducing
|
||||
throughput.
|
||||
|
||||
- Actually, as in CFQ, a queue associated with a process issuing
|
||||
sync requests may not be expired immediately when it empties. In
|
||||
contrast, BFQ may idle the device for a short time interval,
|
||||
giving the process the chance to go on being served if it issues
|
||||
a new request in time. Device idling typically boosts the
|
||||
throughput on rotational devices, if processes do synchronous
|
||||
and sequential I/O. In addition, under BFQ, device idling is
|
||||
also instrumental in guaranteeing the desired throughput
|
||||
fraction to processes issuing sync requests (see the description
|
||||
of the slice_idle tunable in this document, or [1, 2], for more
|
||||
details).
|
||||
|
||||
- With respect to idling for service guarantees, if several
|
||||
processes are competing for the device at the same time, but
|
||||
all processes (and groups, after the following commit) have
|
||||
the same weight, then BFQ guarantees the expected throughput
|
||||
distribution without ever idling the device. Throughput is
|
||||
thus as high as possible in this common scenario.
|
||||
|
||||
- If low-latency mode is enabled (default configuration), BFQ
|
||||
executes some special heuristics to detect interactive and soft
|
||||
real-time applications (e.g., video or audio players/streamers),
|
||||
and to reduce their latency. The most important action taken to
|
||||
achieve this goal is to give to the queues associated with these
|
||||
applications more than their fair share of the device
|
||||
throughput. For brevity, we call just "weight-raising" the whole
|
||||
sets of actions taken by BFQ to privilege these queues. In
|
||||
particular, BFQ provides a milder form of weight-raising for
|
||||
interactive applications, and a stronger form for soft real-time
|
||||
applications.
|
||||
|
||||
- BFQ automatically deactivates idling for queues born in a burst of
|
||||
queue creations. In fact, these queues are usually associated with
|
||||
the processes of applications and services that benefit mostly
|
||||
from a high throughput. Examples are systemd during boot, or git
|
||||
grep.
|
||||
|
||||
- As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
|
||||
performing random I/O that becomes mostly sequential if
|
||||
merged. Differently from CFQ, BFQ achieves this goal with a more
|
||||
reactive mechanism, called Early Queue Merge (EQM). EQM is so
|
||||
responsive in detecting interleaved I/O (cooperating processes),
|
||||
that it enables BFQ to achieve a high throughput, by queue
|
||||
merging, even for queues for which CFQ needs a different
|
||||
mechanism, preemption, to get a high throughput. As such EQM is a
|
||||
unified mechanism to achieve a high throughput with interleaved
|
||||
I/O.
|
||||
|
||||
- Queues are scheduled according to a variant of WF2Q+, named
|
||||
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
|
||||
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
|
||||
also ready for hierarchical scheduling. However, for a cleaner
|
||||
logical breakdown, the code that enables and completes
|
||||
hierarchical support is provided in the next commit, which focuses
|
||||
exactly on this feature.
|
||||
|
||||
- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
|
||||
perfectly fair, and smooth service. In particular, B-WF2Q+
|
||||
guarantees that each queue receives a fraction of the device
|
||||
throughput proportional to its weight, even if the throughput
|
||||
fluctuates, and regardless of: the device parameters, the current
|
||||
workload and the budgets assigned to the queue.
|
||||
|
||||
- The last, budget-independence, property (although probably
|
||||
counterintuitive in the first place) is definitely beneficial, for
|
||||
the following reasons:
|
||||
|
||||
- First, with any proportional-share scheduler, the maximum
|
||||
deviation with respect to an ideal service is proportional to
|
||||
the maximum budget (slice) assigned to queues. As a consequence,
|
||||
BFQ can keep this deviation tight not only because of the
|
||||
accurate service of B-WF2Q+, but also because BFQ *does not*
|
||||
need to assign a larger budget to a queue to let the queue
|
||||
receive a higher fraction of the device throughput.
|
||||
|
||||
- Second, BFQ is free to choose, for every process (queue), the
|
||||
budget that best fits the needs of the process, or best
|
||||
leverages the I/O pattern of the process. In particular, BFQ
|
||||
updates queue budgets with a simple feedback-loop algorithm that
|
||||
allows a high throughput to be achieved, while still providing
|
||||
tight latency guarantees to time-sensitive applications. When
|
||||
the in-service queue expires, this algorithm computes the next
|
||||
budget of the queue so as to:
|
||||
|
||||
- Let large budgets be eventually assigned to the queues
|
||||
associated with I/O-bound applications performing sequential
|
||||
I/O: in fact, the longer these applications are served once
|
||||
got access to the device, the higher the throughput is.
|
||||
|
||||
- Let small budgets be eventually assigned to the queues
|
||||
associated with time-sensitive applications (which typically
|
||||
perform sporadic and short I/O), because, the smaller the
|
||||
budget assigned to a queue waiting for service is, the sooner
|
||||
B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
|
||||
|
||||
- If several processes are competing for the device at the same time,
|
||||
but all processes and groups have the same weight, then BFQ
|
||||
guarantees the expected throughput distribution without ever idling
|
||||
the device. It uses preemption instead. Throughput is then much
|
||||
higher in this common scenario.
|
||||
|
||||
- ioprio classes are served in strict priority order, i.e.,
|
||||
lower-priority queues are not served as long as there are
|
||||
higher-priority queues. Among queues in the same class, the
|
||||
bandwidth is distributed in proportion to the weight of each
|
||||
queue. A very thin extra bandwidth is however guaranteed to
|
||||
the Idle class, to prevent it from starving.
|
||||
|
||||
|
||||
3. What are BFQ's tunable?
|
||||
==========================
|
||||
|
||||
The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
|
||||
fifo_expire_sync below are the same as in CFQ. Their description is
|
||||
just copied from that for CFQ. Some considerations in the description
|
||||
of slice_idle are copied from CFQ too.
|
||||
|
||||
per-process ioprio and weight
|
||||
-----------------------------
|
||||
|
||||
Unless the cgroups interface is used (see "4. BFQ group scheduling"),
|
||||
weights can be assigned to processes only indirectly, through I/O
|
||||
priorities, and according to the relation:
|
||||
weight = (IOPRIO_BE_NR - ioprio) * 10.
|
||||
|
||||
Beware that, if low-latency is set, then BFQ automatically raises the
|
||||
weight of the queues associated with interactive and soft real-time
|
||||
applications. Unset this tunable if you need/want to control weights.
|
||||
|
||||
slice_idle
|
||||
----------
|
||||
|
||||
This parameter specifies how long BFQ should idle for next I/O
|
||||
request, when certain sync BFQ queues become empty. By default
|
||||
slice_idle is a non-zero value. Idling has a double purpose: boosting
|
||||
throughput and making sure that the desired throughput distribution is
|
||||
respected (see the description of how BFQ works, and, if needed, the
|
||||
papers referred there).
|
||||
|
||||
As for throughput, idling can be very helpful on highly seeky media
|
||||
like single spindle SATA/SAS disks where we can cut down on overall
|
||||
number of seeks and see improved throughput.
|
||||
|
||||
Setting slice_idle to 0 will remove all the idling on queues and one
|
||||
should see an overall improved throughput on faster storage devices
|
||||
like multiple SATA/SAS disks in hardware RAID configuration.
|
||||
|
||||
So depending on storage and workload, it might be useful to set
|
||||
slice_idle=0. In general for SATA/SAS disks and software RAID of
|
||||
SATA/SAS disks keeping slice_idle enabled should be useful. For any
|
||||
configurations where there are multiple spindles behind single LUN
|
||||
(Host based hardware RAID controller or for storage arrays), setting
|
||||
slice_idle=0 might end up in better throughput and acceptable
|
||||
latencies.
|
||||
|
||||
Idling is however necessary to have service guarantees enforced in
|
||||
case of differentiated weights or differentiated I/O-request lengths.
|
||||
To see why, suppose that a given BFQ queue A must get several I/O
|
||||
requests served for each request served for another queue B. Idling
|
||||
ensures that, if A makes a new I/O request slightly after becoming
|
||||
empty, then no request of B is dispatched in the middle, and thus A
|
||||
does not lose the possibility to get more than one request dispatched
|
||||
before the next request of B is dispatched. Note that idling
|
||||
guarantees the desired differentiated treatment of queues only in
|
||||
terms of I/O-request dispatches. To guarantee that the actual service
|
||||
order then corresponds to the dispatch order, the strict_guarantees
|
||||
tunable must be set too.
|
||||
|
||||
There is an important flipside for idling: apart from the above cases
|
||||
where it is beneficial also for throughput, idling can severely impact
|
||||
throughput. One important case is random workload. Because of this
|
||||
issue, BFQ tends to avoid idling as much as possible, when it is not
|
||||
beneficial also for throughput. As a consequence of this behavior, and
|
||||
of further issues described for the strict_guarantees tunable,
|
||||
short-term service guarantees may be occasionally violated. And, in
|
||||
some cases, these guarantees may be more important than guaranteeing
|
||||
maximum throughput. For example, in video playing/streaming, a very
|
||||
low drop rate may be more important than maximum throughput. In these
|
||||
cases, consider setting the strict_guarantees parameter.
|
||||
|
||||
strict_guarantees
|
||||
-----------------
|
||||
|
||||
If this parameter is set (default: unset), then BFQ
|
||||
|
||||
- always performs idling when the in-service queue becomes empty;
|
||||
|
||||
- forces the device to serve one I/O request at a time, by dispatching a
|
||||
new request only if there is no outstanding request.
|
||||
|
||||
In the presence of differentiated weights or I/O-request sizes, both
|
||||
the above conditions are needed to guarantee that every BFQ queue
|
||||
receives its allotted share of the bandwidth. The first condition is
|
||||
needed for the reasons explained in the description of the slice_idle
|
||||
tunable. The second condition is needed because all modern storage
|
||||
devices reorder internally-queued requests, which may trivially break
|
||||
the service guarantees enforced by the I/O scheduler.
|
||||
|
||||
Setting strict_guarantees may evidently affect throughput.
|
||||
|
||||
back_seek_max
|
||||
-------------
|
||||
|
||||
This specifies, given in Kbytes, the maximum "distance" for backward seeking.
|
||||
The distance is the amount of space from the current head location to the
|
||||
sectors that are backward in terms of distance.
|
||||
|
||||
This parameter allows the scheduler to anticipate requests in the "backward"
|
||||
direction and consider them as being the "next" if they are within this
|
||||
distance from the current head location.
|
||||
|
||||
back_seek_penalty
|
||||
-----------------
|
||||
|
||||
This parameter is used to compute the cost of backward seeking. If the
|
||||
backward distance of request is just 1/back_seek_penalty from a "front"
|
||||
request, then the seeking cost of two requests is considered equivalent.
|
||||
|
||||
So scheduler will not bias toward one or the other request (otherwise scheduler
|
||||
will bias toward front request). Default value of back_seek_penalty is 2.
|
||||
|
||||
fifo_expire_async
|
||||
-----------------
|
||||
|
||||
This parameter is used to set the timeout of asynchronous requests. Default
|
||||
value of this is 248ms.
|
||||
|
||||
fifo_expire_sync
|
||||
----------------
|
||||
|
||||
This parameter is used to set the timeout of synchronous requests. Default
|
||||
value of this is 124ms. In case to favor synchronous requests over asynchronous
|
||||
one, this value should be decreased relative to fifo_expire_async.
|
||||
|
||||
low_latency
|
||||
-----------
|
||||
|
||||
This parameter is used to enable/disable BFQ's low latency mode. By
|
||||
default, low latency mode is enabled. If enabled, interactive and soft
|
||||
real-time applications are privileged and experience a lower latency,
|
||||
as explained in more detail in the description of how BFQ works.
|
||||
|
||||
DISABLE this mode if you need full control on bandwidth
|
||||
distribution. In fact, if it is enabled, then BFQ automatically
|
||||
increases the bandwidth share of privileged applications, as the main
|
||||
means to guarantee a lower latency to them.
|
||||
|
||||
In addition, as already highlighted at the beginning of this document,
|
||||
DISABLE this mode if your only goal is to achieve a high throughput.
|
||||
In fact, privileging the I/O of some application over the rest may
|
||||
entail a lower throughput. To achieve the highest-possible throughput
|
||||
on a non-rotational device, setting slice_idle to 0 may be needed too
|
||||
(at the cost of giving up any strong guarantee on fairness and low
|
||||
latency).
|
||||
|
||||
timeout_sync
|
||||
------------
|
||||
|
||||
Maximum amount of device time that can be given to a task (queue) once
|
||||
it has been selected for service. On devices with costly seeks,
|
||||
increasing this time usually increases maximum throughput. On the
|
||||
opposite end, increasing this time coarsens the granularity of the
|
||||
short-term bandwidth and latency guarantees, especially if the
|
||||
following parameter is set to zero.
|
||||
|
||||
max_budget
|
||||
----------
|
||||
|
||||
Maximum amount of service, measured in sectors, that can be provided
|
||||
to a BFQ queue once it is set in service (of course within the limits
|
||||
of the above timeout). According to what said in the description of
|
||||
the algorithm, larger values increase the throughput in proportion to
|
||||
the percentage of sequential I/O requests issued. The price of larger
|
||||
values is that they coarsen the granularity of short-term bandwidth
|
||||
and latency guarantees.
|
||||
|
||||
The default value is 0, which enables auto-tuning: BFQ sets max_budget
|
||||
to the maximum number of sectors that can be served during
|
||||
timeout_sync, according to the estimated peak rate.
|
||||
|
||||
weights
|
||||
-------
|
||||
|
||||
Read-only parameter, used to show the weights of the currently active
|
||||
BFQ queues.
|
||||
|
||||
|
||||
wr_ tunables
|
||||
------------
|
||||
|
||||
BFQ exports a few parameters to control/tune the behavior of
|
||||
low-latency heuristics.
|
||||
|
||||
wr_coeff
|
||||
|
||||
Factor by which the weight of a weight-raised queue is multiplied. If
|
||||
the queue is deemed soft real-time, then the weight is further
|
||||
multiplied by an additional, constant factor.
|
||||
|
||||
wr_max_time
|
||||
|
||||
Maximum duration of a weight-raising period for an interactive task
|
||||
(ms). If set to zero (default value), then this value is computed
|
||||
automatically, as a function of the peak rate of the device. In any
|
||||
case, when the value of this parameter is read, it always reports the
|
||||
current duration, regardless of whether it has been set manually or
|
||||
computed automatically.
|
||||
|
||||
wr_max_softrt_rate
|
||||
|
||||
Maximum service rate below which a queue is deemed to be associated
|
||||
with a soft real-time application, and is then weight-raised
|
||||
accordingly (sectors/sec).
|
||||
|
||||
wr_min_idle_time
|
||||
|
||||
Minimum idle period after which interactive weight-raising may be
|
||||
reactivated for a queue (in ms).
|
||||
|
||||
wr_rt_max_time
|
||||
|
||||
Maximum weight-raising duration for soft real-time queues (in ms). The
|
||||
start time from which this duration is considered is automatically
|
||||
moved forward if the queue is detected to be still soft real-time
|
||||
before the current soft real-time weight-raising period finishes.
|
||||
|
||||
wr_min_inter_arr_async
|
||||
|
||||
Minimum period between I/O request arrivals after which weight-raising
|
||||
may be reactivated for an already busy async queue (in ms).
|
||||
|
||||
|
||||
4. Group scheduling with BFQ
|
||||
============================
|
||||
|
||||
BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
|
||||
blkio and io. In particular, BFQ supports weight-based proportional
|
||||
share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
|
||||
|
||||
4-1 Service guarantees provided
|
||||
-------------------------------
|
||||
|
||||
With BFQ, proportional share means true proportional share of the
|
||||
device bandwidth, according to group weights. For example, a group
|
||||
with weight 200 gets twice the bandwidth, and not just twice the time,
|
||||
of a group with weight 100.
|
||||
|
||||
BFQ supports hierarchies (group trees) of any depth. Bandwidth is
|
||||
distributed among groups and processes in the expected way: for each
|
||||
group, the children of the group share the whole bandwidth of the
|
||||
group in proportion to their weights. In particular, this implies
|
||||
that, for each leaf group, every process of the group receives the
|
||||
same share of the whole group bandwidth, unless the ioprio of the
|
||||
process is modified.
|
||||
|
||||
The resource-sharing guarantee for a group may partially or totally
|
||||
switch from bandwidth to time, if providing bandwidth guarantees to
|
||||
the group lowers the throughput too much. This switch occurs on a
|
||||
per-process basis: if a process of a leaf group causes throughput loss
|
||||
if served in such a way to receive its share of the bandwidth, then
|
||||
BFQ switches back to just time-based proportional share for that
|
||||
process.
|
||||
|
||||
4-2 Interface
|
||||
-------------
|
||||
|
||||
To get proportional sharing of bandwidth with BFQ for a given device,
|
||||
BFQ must of course be the active scheduler for that device.
|
||||
|
||||
Within each group directory, the names of the files associated with
|
||||
BFQ-specific cgroup parameters and stats begin with the "bfq."
|
||||
prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
|
||||
BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
|
||||
parameter to set the weight of a group with BFQ is blkio.bfq.weight
|
||||
or io.bfq.weight.
|
||||
|
||||
Parameters to set
|
||||
-----------------
|
||||
|
||||
For each group, there is only the following parameter to set.
|
||||
|
||||
weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
|
||||
group inside its parent. Available values: 1..10000 (default 100). The
|
||||
linear mapping between ioprio and weights, described at the beginning
|
||||
of the tunable section, is still valid, but all weights higher than
|
||||
IOPRIO_BE_NR*10 are mapped to ioprio 0.
|
||||
|
||||
Recall that, if low-latency is set, then BFQ automatically raises the
|
||||
weight of the queues associated with interactive and soft real-time
|
||||
applications. Unset this tunable if you need/want to control weights.
|
||||
|
||||
|
||||
[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
|
||||
Scheduler", Proceedings of the First Workshop on Mobile System
|
||||
Technologies (MST-2015), May 2015.
|
||||
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
|
||||
|
||||
[2] P. Valente and M. Andreolini, "Improving Application
|
||||
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
|
||||
the 5th Annual International Systems and Storage Conference
|
||||
(SYSTOR '12), June 2012.
|
||||
Slightly extended version:
|
||||
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
|
||||
results.pdf
|
|
@ -54,20 +54,20 @@ config IOSCHED_BFQ
|
|||
tristate "BFQ I/O scheduler"
|
||||
default n
|
||||
---help---
|
||||
The BFQ I/O scheduler tries to distribute bandwidth among
|
||||
all processes according to their weights.
|
||||
It aims at distributing the bandwidth as desired, independently of
|
||||
the disk parameters and with any workload. It also tries to
|
||||
guarantee low latency to interactive and soft real-time
|
||||
applications. If compiled built-in (saying Y here), BFQ can
|
||||
be configured to support hierarchical scheduling.
|
||||
The BFQ I/O scheduler distributes bandwidth among all
|
||||
processes according to their weights, regardless of the
|
||||
device parameters and with any workload. It also guarantees
|
||||
a low latency to interactive and soft real-time applications.
|
||||
Details in Documentation/block/bfq-iosched.txt
|
||||
|
||||
config BFQ_GROUP_IOSCHED
|
||||
bool "BFQ hierarchical scheduling support"
|
||||
depends on CGROUPS && IOSCHED_BFQ=y
|
||||
depends on IOSCHED_BFQ && BLK_CGROUP
|
||||
default n
|
||||
---help---
|
||||
Enable hierarchical scheduling in BFQ, using the blkio controller.
|
||||
|
||||
Enable hierarchical scheduling in BFQ, using the blkio
|
||||
(cgroups-v1) or io (cgroups-v2) controller.
|
||||
|
||||
choice
|
||||
prompt "Default I/O scheduler"
|
||||
|
|
|
@ -7,7 +7,9 @@
|
|||
* Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
|
||||
* Paolo Valente <paolo.valente@unimore.it>
|
||||
*
|
||||
* Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
|
||||
* Copyright (C) 2015 Paolo Valente <paolo.valente@unimore.it>
|
||||
*
|
||||
* Copyright (C) 2016 Paolo Valente <paolo.valente@linaro.org>
|
||||
*
|
||||
* Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
|
||||
* file.
|
||||
|
@ -162,7 +164,7 @@ static struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg)
|
|||
static struct bfq_group *blkg_to_bfqg(struct blkcg_gq *blkg)
|
||||
{
|
||||
struct blkg_policy_data *pd = blkg_to_pd(blkg, &blkcg_policy_bfq);
|
||||
BUG_ON(!pd);
|
||||
|
||||
return pd_to_bfqg(pd);
|
||||
}
|
||||
|
||||
|
@ -224,14 +226,6 @@ static void bfqg_stats_update_io_merged(struct bfq_group *bfqg, int rw)
|
|||
blkg_rwstat_add(&bfqg->stats.merged, rw, 1);
|
||||
}
|
||||
|
||||
static void bfqg_stats_update_dispatch(struct bfq_group *bfqg,
|
||||
uint64_t bytes, int rw)
|
||||
{
|
||||
blkg_stat_add(&bfqg->stats.sectors, bytes >> 9);
|
||||
blkg_rwstat_add(&bfqg->stats.serviced, rw, 1);
|
||||
blkg_rwstat_add(&bfqg->stats.service_bytes, rw, bytes);
|
||||
}
|
||||
|
||||
static void bfqg_stats_update_completion(struct bfq_group *bfqg,
|
||||
uint64_t start_time, uint64_t io_start_time, int rw)
|
||||
{
|
||||
|
@ -248,17 +242,11 @@ static void bfqg_stats_update_completion(struct bfq_group *bfqg,
|
|||
/* @stats = 0 */
|
||||
static void bfqg_stats_reset(struct bfqg_stats *stats)
|
||||
{
|
||||
if (!stats)
|
||||
return;
|
||||
|
||||
/* queued stats shouldn't be cleared */
|
||||
blkg_rwstat_reset(&stats->service_bytes);
|
||||
blkg_rwstat_reset(&stats->serviced);
|
||||
blkg_rwstat_reset(&stats->merged);
|
||||
blkg_rwstat_reset(&stats->service_time);
|
||||
blkg_rwstat_reset(&stats->wait_time);
|
||||
blkg_stat_reset(&stats->time);
|
||||
blkg_stat_reset(&stats->unaccounted_time);
|
||||
blkg_stat_reset(&stats->avg_queue_size_sum);
|
||||
blkg_stat_reset(&stats->avg_queue_size_samples);
|
||||
blkg_stat_reset(&stats->dequeue);
|
||||
|
@ -268,19 +256,16 @@ static void bfqg_stats_reset(struct bfqg_stats *stats)
|
|||
}
|
||||
|
||||
/* @to += @from */
|
||||
static void bfqg_stats_merge(struct bfqg_stats *to, struct bfqg_stats *from)
|
||||
static void bfqg_stats_add_aux(struct bfqg_stats *to, struct bfqg_stats *from)
|
||||
{
|
||||
if (!to || !from)
|
||||
return;
|
||||
|
||||
/* queued stats shouldn't be cleared */
|
||||
blkg_rwstat_add_aux(&to->service_bytes, &from->service_bytes);
|
||||
blkg_rwstat_add_aux(&to->serviced, &from->serviced);
|
||||
blkg_rwstat_add_aux(&to->merged, &from->merged);
|
||||
blkg_rwstat_add_aux(&to->service_time, &from->service_time);
|
||||
blkg_rwstat_add_aux(&to->wait_time, &from->wait_time);
|
||||
blkg_stat_add_aux(&from->time, &from->time);
|
||||
blkg_stat_add_aux(&to->unaccounted_time, &from->unaccounted_time);
|
||||
blkg_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum);
|
||||
blkg_stat_add_aux(&to->avg_queue_size_samples, &from->avg_queue_size_samples);
|
||||
blkg_stat_add_aux(&to->dequeue, &from->dequeue);
|
||||
|
@ -308,10 +293,8 @@ static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
|
|||
if (unlikely(!parent))
|
||||
return;
|
||||
|
||||
bfqg_stats_merge(&parent->dead_stats, &bfqg->stats);
|
||||
bfqg_stats_merge(&parent->dead_stats, &bfqg->dead_stats);
|
||||
bfqg_stats_add_aux(&parent->stats, &bfqg->stats);
|
||||
bfqg_stats_reset(&bfqg->stats);
|
||||
bfqg_stats_reset(&bfqg->dead_stats);
|
||||
}
|
||||
|
||||
static void bfq_init_entity(struct bfq_entity *entity,
|
||||
|
@ -326,21 +309,17 @@ static void bfq_init_entity(struct bfq_entity *entity,
|
|||
bfqq->ioprio_class = bfqq->new_ioprio_class;
|
||||
bfqg_get(bfqg);
|
||||
}
|
||||
entity->parent = bfqg->my_entity;
|
||||
entity->parent = bfqg->my_entity; /* NULL for root group */
|
||||
entity->sched_data = &bfqg->sched_data;
|
||||
}
|
||||
|
||||
static void bfqg_stats_exit(struct bfqg_stats *stats)
|
||||
{
|
||||
blkg_rwstat_exit(&stats->service_bytes);
|
||||
blkg_rwstat_exit(&stats->serviced);
|
||||
blkg_rwstat_exit(&stats->merged);
|
||||
blkg_rwstat_exit(&stats->service_time);
|
||||
blkg_rwstat_exit(&stats->wait_time);
|
||||
blkg_rwstat_exit(&stats->queued);
|
||||
blkg_stat_exit(&stats->sectors);
|
||||
blkg_stat_exit(&stats->time);
|
||||
blkg_stat_exit(&stats->unaccounted_time);
|
||||
blkg_stat_exit(&stats->avg_queue_size_sum);
|
||||
blkg_stat_exit(&stats->avg_queue_size_samples);
|
||||
blkg_stat_exit(&stats->dequeue);
|
||||
|
@ -351,15 +330,11 @@ static void bfqg_stats_exit(struct bfqg_stats *stats)
|
|||
|
||||
static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp)
|
||||
{
|
||||
if (blkg_rwstat_init(&stats->service_bytes, gfp) ||
|
||||
blkg_rwstat_init(&stats->serviced, gfp) ||
|
||||
blkg_rwstat_init(&stats->merged, gfp) ||
|
||||
if (blkg_rwstat_init(&stats->merged, gfp) ||
|
||||
blkg_rwstat_init(&stats->service_time, gfp) ||
|
||||
blkg_rwstat_init(&stats->wait_time, gfp) ||
|
||||
blkg_rwstat_init(&stats->queued, gfp) ||
|
||||
blkg_stat_init(&stats->sectors, gfp) ||
|
||||
blkg_stat_init(&stats->time, gfp) ||
|
||||
blkg_stat_init(&stats->unaccounted_time, gfp) ||
|
||||
blkg_stat_init(&stats->avg_queue_size_sum, gfp) ||
|
||||
blkg_stat_init(&stats->avg_queue_size_samples, gfp) ||
|
||||
blkg_stat_init(&stats->dequeue, gfp) ||
|
||||
|
@ -383,11 +358,27 @@ static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
|
|||
return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
|
||||
}
|
||||
|
||||
static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
|
||||
{
|
||||
struct bfq_group_data *bgd;
|
||||
|
||||
bgd = kzalloc(sizeof(*bgd), GFP_KERNEL);
|
||||
if (!bgd)
|
||||
return NULL;
|
||||
return &bgd->pd;
|
||||
}
|
||||
|
||||
static void bfq_cpd_init(struct blkcg_policy_data *cpd)
|
||||
{
|
||||
struct bfq_group_data *d = cpd_to_bfqgd(cpd);
|
||||
|
||||
d->weight = BFQ_DEFAULT_GRP_WEIGHT;
|
||||
d->weight = cgroup_subsys_on_dfl(io_cgrp_subsys) ?
|
||||
CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
|
||||
}
|
||||
|
||||
static void bfq_cpd_free(struct blkcg_policy_data *cpd)
|
||||
{
|
||||
kfree(cpd_to_bfqgd(cpd));
|
||||
}
|
||||
|
||||
static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
|
||||
|
@ -398,8 +389,7 @@ static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
|
|||
if (!bfqg)
|
||||
return NULL;
|
||||
|
||||
if (bfqg_stats_init(&bfqg->stats, gfp) ||
|
||||
bfqg_stats_init(&bfqg->dead_stats, gfp)) {
|
||||
if (bfqg_stats_init(&bfqg->stats, gfp)) {
|
||||
kfree(bfqg);
|
||||
return NULL;
|
||||
}
|
||||
|
@ -407,27 +397,20 @@ static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
|
|||
return &bfqg->pd;
|
||||
}
|
||||
|
||||
static void bfq_group_set_parent(struct bfq_group *bfqg,
|
||||
struct bfq_group *parent)
|
||||
{
|
||||
struct bfq_entity *entity;
|
||||
|
||||
BUG_ON(!parent);
|
||||
BUG_ON(!bfqg);
|
||||
BUG_ON(bfqg == parent);
|
||||
|
||||
entity = &bfqg->entity;
|
||||
entity->parent = parent->my_entity;
|
||||
entity->sched_data = &parent->sched_data;
|
||||
}
|
||||
|
||||
static void bfq_pd_init(struct blkg_policy_data *pd)
|
||||
{
|
||||
struct blkcg_gq *blkg = pd_to_blkg(pd);
|
||||
struct bfq_group *bfqg = blkg_to_bfqg(blkg);
|
||||
struct bfq_data *bfqd = blkg->q->elevator->elevator_data;
|
||||
struct bfq_entity *entity = &bfqg->entity;
|
||||
struct bfq_group_data *d = blkcg_to_bfqgd(blkg->blkcg);
|
||||
struct blkcg_gq *blkg;
|
||||
struct bfq_group *bfqg;
|
||||
struct bfq_data *bfqd;
|
||||
struct bfq_entity *entity;
|
||||
struct bfq_group_data *d;
|
||||
|
||||
blkg = pd_to_blkg(pd);
|
||||
BUG_ON(!blkg);
|
||||
bfqg = blkg_to_bfqg(blkg);
|
||||
bfqd = blkg->q->elevator->elevator_data;
|
||||
entity = &bfqg->entity;
|
||||
d = blkcg_to_bfqgd(blkg->blkcg);
|
||||
|
||||
entity->orig_weight = entity->weight = entity->new_weight = d->weight;
|
||||
entity->my_sched_data = &bfqg->sched_data;
|
||||
|
@ -445,70 +428,53 @@ static void bfq_pd_free(struct blkg_policy_data *pd)
|
|||
struct bfq_group *bfqg = pd_to_bfqg(pd);
|
||||
|
||||
bfqg_stats_exit(&bfqg->stats);
|
||||
bfqg_stats_exit(&bfqg->dead_stats);
|
||||
|
||||
return kfree(bfqg);
|
||||
}
|
||||
|
||||
/* offset delta from bfqg->stats to bfqg->dead_stats */
|
||||
static const int dead_stats_off_delta = offsetof(struct bfq_group, dead_stats) -
|
||||
offsetof(struct bfq_group, stats);
|
||||
|
||||
/* to be used by recursive prfill, sums live and dead stats recursively */
|
||||
static u64 bfqg_stat_pd_recursive_sum(struct blkg_policy_data *pd, int off)
|
||||
{
|
||||
u64 sum = 0;
|
||||
|
||||
sum += blkg_stat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq, off);
|
||||
sum += blkg_stat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq,
|
||||
off + dead_stats_off_delta);
|
||||
return sum;
|
||||
}
|
||||
|
||||
/* to be used by recursive prfill, sums live and dead rwstats recursively */
|
||||
static struct blkg_rwstat bfqg_rwstat_pd_recursive_sum(struct blkg_policy_data *pd,
|
||||
int off)
|
||||
{
|
||||
struct blkg_rwstat a, b;
|
||||
|
||||
a = blkg_rwstat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq, off);
|
||||
b = blkg_rwstat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq,
|
||||
off + dead_stats_off_delta);
|
||||
blkg_rwstat_add_aux(&a, &b);
|
||||
return a;
|
||||
}
|
||||
|
||||
static void bfq_pd_reset_stats(struct blkg_policy_data *pd)
|
||||
{
|
||||
struct bfq_group *bfqg = pd_to_bfqg(pd);
|
||||
|
||||
bfqg_stats_reset(&bfqg->stats);
|
||||
bfqg_stats_reset(&bfqg->dead_stats);
|
||||
}
|
||||
|
||||
static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
|
||||
struct blkcg *blkcg)
|
||||
static void bfq_group_set_parent(struct bfq_group *bfqg,
|
||||
struct bfq_group *parent)
|
||||
{
|
||||
struct request_queue *q = bfqd->queue;
|
||||
struct bfq_group *bfqg = NULL, *parent;
|
||||
struct bfq_entity *entity = NULL;
|
||||
struct bfq_entity *entity;
|
||||
BUG_ON(!parent);
|
||||
BUG_ON(!bfqg);
|
||||
BUG_ON(bfqg == parent);
|
||||
|
||||
entity = &bfqg->entity;
|
||||
entity->parent = parent->my_entity;
|
||||
entity->sched_data = &parent->sched_data;
|
||||
}
|
||||
|
||||
static struct bfq_group *bfq_lookup_bfqg(struct bfq_data *bfqd,
|
||||
struct blkcg *blkcg)
|
||||
{
|
||||
struct blkcg_gq *blkg;
|
||||
|
||||
blkg = blkg_lookup(blkcg, bfqd->queue);
|
||||
if (likely(blkg))
|
||||
return blkg_to_bfqg(blkg);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
|
||||
struct blkcg *blkcg)
|
||||
{
|
||||
struct bfq_group *bfqg, *parent;
|
||||
struct bfq_entity *entity;
|
||||
|
||||
assert_spin_locked(bfqd->queue->queue_lock);
|
||||
|
||||
/* avoid lookup for the common case where there's no blkcg */
|
||||
if (blkcg == &blkcg_root) {
|
||||
bfqg = bfqd->root_group;
|
||||
} else {
|
||||
struct blkcg_gq *blkg;
|
||||
bfqg = bfq_lookup_bfqg(bfqd, blkcg);
|
||||
|
||||
blkg = blkg_lookup_create(blkcg, q);
|
||||
if (!IS_ERR(blkg))
|
||||
bfqg = blkg_to_bfqg(blkg);
|
||||
else /* fallback to root_group */
|
||||
bfqg = bfqd->root_group;
|
||||
}
|
||||
|
||||
BUG_ON(!bfqg);
|
||||
if (unlikely(!bfqg))
|
||||
return NULL;
|
||||
|
||||
/*
|
||||
* Update chain of bfq_groups as we might be handling a leaf group
|
||||
|
@ -531,13 +497,18 @@ static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
|
|||
return bfqg;
|
||||
}
|
||||
|
||||
static void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq);
|
||||
static void bfq_pos_tree_add_move(struct bfq_data *bfqd,
|
||||
struct bfq_queue *bfqq);
|
||||
|
||||
static void bfq_bfqq_expire(struct bfq_data *bfqd,
|
||||
struct bfq_queue *bfqq,
|
||||
bool compensate,
|
||||
enum bfqq_expiration reason);
|
||||
|
||||
/**
|
||||
* bfq_bfqq_move - migrate @bfqq to @bfqg.
|
||||
* @bfqd: queue descriptor.
|
||||
* @bfqq: the queue to move.
|
||||
* @entity: @bfqq's entity.
|
||||
* @bfqg: the group to move to.
|
||||
*
|
||||
* Move @bfqq to @bfqg, deactivating it from its old group and reactivating
|
||||
|
@ -548,26 +519,40 @@ static void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
|
|||
* rcu_read_lock()).
|
||||
*/
|
||||
static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
|
||||
struct bfq_entity *entity, struct bfq_group *bfqg)
|
||||
struct bfq_group *bfqg)
|
||||
{
|
||||
int busy, resume;
|
||||
struct bfq_entity *entity = &bfqq->entity;
|
||||
|
||||
busy = bfq_bfqq_busy(bfqq);
|
||||
resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
|
||||
|
||||
BUG_ON(resume && !entity->on_st);
|
||||
BUG_ON(busy && !resume && entity->on_st &&
|
||||
BUG_ON(!bfq_bfqq_busy(bfqq) && !RB_EMPTY_ROOT(&bfqq->sort_list));
|
||||
BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list) && !entity->on_st);
|
||||
BUG_ON(bfq_bfqq_busy(bfqq) && RB_EMPTY_ROOT(&bfqq->sort_list)
|
||||
&& entity->on_st &&
|
||||
bfqq != bfqd->in_service_queue);
|
||||
BUG_ON(!bfq_bfqq_busy(bfqq) && bfqq == bfqd->in_service_queue);
|
||||
|
||||
if (busy) {
|
||||
BUG_ON(atomic_read(&bfqq->ref) < 2);
|
||||
/* If bfqq is empty, then bfq_bfqq_expire also invokes
|
||||
* bfq_del_bfqq_busy, thereby removing bfqq and its entity
|
||||
* from data structures related to current group. Otherwise we
|
||||
* need to remove bfqq explicitly with bfq_deactivate_bfqq, as
|
||||
* we do below.
|
||||
*/
|
||||
if (bfqq == bfqd->in_service_queue)
|
||||
bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
|
||||
false, BFQ_BFQQ_PREEMPTED);
|
||||
|
||||
if (!resume)
|
||||
bfq_del_bfqq_busy(bfqd, bfqq, 0);
|
||||
else
|
||||
bfq_deactivate_bfqq(bfqd, bfqq, 0);
|
||||
} else if (entity->on_st)
|
||||
BUG_ON(entity->on_st && !bfq_bfqq_busy(bfqq)
|
||||
&& &bfq_entity_service_tree(entity)->idle !=
|
||||
entity->tree);
|
||||
|
||||
BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_busy(bfqq));
|
||||
|
||||
if (bfq_bfqq_busy(bfqq))
|
||||
bfq_deactivate_bfqq(bfqd, bfqq, false, false);
|
||||
else if (entity->on_st) {
|
||||
BUG_ON(&bfq_entity_service_tree(entity)->idle !=
|
||||
entity->tree);
|
||||
bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
|
||||
}
|
||||
bfqg_put(bfqq_group(bfqq));
|
||||
|
||||
/*
|
||||
|
@ -579,14 +564,17 @@ static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
|
|||
entity->sched_data = &bfqg->sched_data;
|
||||
bfqg_get(bfqg);
|
||||
|
||||
if (busy) {
|
||||
BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_busy(bfqq));
|
||||
if (bfq_bfqq_busy(bfqq)) {
|
||||
bfq_pos_tree_add_move(bfqd, bfqq);
|
||||
if (resume)
|
||||
bfq_activate_bfqq(bfqd, bfqq);
|
||||
bfq_activate_bfqq(bfqd, bfqq);
|
||||
}
|
||||
|
||||
if (!bfqd->in_service_queue && !bfqd->rq_in_driver)
|
||||
bfq_schedule_dispatch(bfqd);
|
||||
BUG_ON(entity->on_st && !bfq_bfqq_busy(bfqq)
|
||||
&& &bfq_entity_service_tree(entity)->idle !=
|
||||
entity->tree);
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -613,7 +601,11 @@ static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
|
|||
|
||||
lockdep_assert_held(bfqd->queue->queue_lock);
|
||||
|
||||
bfqg = bfq_find_alloc_group(bfqd, blkcg);
|
||||
bfqg = bfq_find_set_group(bfqd, blkcg);
|
||||
|
||||
if (unlikely(!bfqg))
|
||||
bfqg = bfqd->root_group;
|
||||
|
||||
if (async_bfqq) {
|
||||
entity = &async_bfqq->entity;
|
||||
|
||||
|
@ -621,7 +613,8 @@ static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
|
|||
bic_set_bfqq(bic, NULL, 0);
|
||||
bfq_log_bfqq(bfqd, async_bfqq,
|
||||
"bic_change_group: %p %d",
|
||||
async_bfqq, atomic_read(&async_bfqq->ref));
|
||||
async_bfqq,
|
||||
async_bfqq->ref);
|
||||
bfq_put_queue(async_bfqq);
|
||||
}
|
||||
}
|
||||
|
@ -629,7 +622,7 @@ static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
|
|||
if (sync_bfqq) {
|
||||
entity = &sync_bfqq->entity;
|
||||
if (entity->sched_data != &bfqg->sched_data)
|
||||
bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
|
||||
bfq_bfqq_move(bfqd, sync_bfqq, bfqg);
|
||||
}
|
||||
|
||||
return bfqg;
|
||||
|
@ -638,25 +631,23 @@ static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
|
|||
static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
|
||||
{
|
||||
struct bfq_data *bfqd = bic_to_bfqd(bic);
|
||||
struct blkcg *blkcg;
|
||||
struct bfq_group *bfqg = NULL;
|
||||
uint64_t id;
|
||||
uint64_t serial_nr;
|
||||
|
||||
rcu_read_lock();
|
||||
blkcg = bio_blkcg(bio);
|
||||
id = blkcg->css.serial_nr;
|
||||
rcu_read_unlock();
|
||||
serial_nr = bio_blkcg(bio)->css.serial_nr;
|
||||
|
||||
/*
|
||||
* Check whether blkcg has changed. The condition may trigger
|
||||
* spuriously on a newly created cic but there's no harm.
|
||||
*/
|
||||
if (unlikely(!bfqd) || likely(bic->blkcg_id == id))
|
||||
return;
|
||||
if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr))
|
||||
goto out;
|
||||
|
||||
bfqg = __bfq_bic_change_cgroup(bfqd, bic, blkcg);
|
||||
BUG_ON(!bfqg);
|
||||
bic->blkcg_id = id;
|
||||
bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
|
||||
bic->blkcg_serial_nr = serial_nr;
|
||||
out:
|
||||
rcu_read_unlock();
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -668,7 +659,7 @@ static void bfq_flush_idle_tree(struct bfq_service_tree *st)
|
|||
struct bfq_entity *entity = st->first_idle;
|
||||
|
||||
for (; entity ; entity = st->first_idle)
|
||||
__bfq_deactivate_entity(entity, 0);
|
||||
__bfq_deactivate_entity(entity, false);
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -682,7 +673,7 @@ static void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
|
|||
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
|
||||
|
||||
BUG_ON(!bfqq);
|
||||
bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
|
||||
bfq_bfqq_move(bfqd, bfqq, bfqd->root_group);
|
||||
return;
|
||||
}
|
||||
|
||||
|
@ -716,11 +707,12 @@ static void bfq_reparent_active_entities(struct bfq_data *bfqd,
|
|||
}
|
||||
|
||||
/**
|
||||
* bfq_destroy_group - destroy @bfqg.
|
||||
* @bfqg: the group being destroyed.
|
||||
* bfq_pd_offline - deactivate the entity associated with @pd,
|
||||
* and reparent its children entities.
|
||||
* @pd: descriptor of the policy going offline.
|
||||
*
|
||||
* Destroy @bfqg, making sure that it is not referenced from its parent.
|
||||
* blkio already grabs the queue_lock for us, so no need to use RCU-based magic
|
||||
* blkio already grabs the queue_lock for us, so no need to use
|
||||
* RCU-based magic
|
||||
*/
|
||||
static void bfq_pd_offline(struct blkg_policy_data *pd)
|
||||
{
|
||||
|
@ -775,10 +767,15 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
|
|||
BUG_ON(bfqg->sched_data.next_in_service);
|
||||
BUG_ON(bfqg->sched_data.in_service_entity);
|
||||
|
||||
__bfq_deactivate_entity(entity, 0);
|
||||
__bfq_deactivate_entity(entity, false);
|
||||
bfq_put_async_queues(bfqd, bfqg);
|
||||
BUG_ON(entity->tree);
|
||||
|
||||
/*
|
||||
* @blkg is going offline and will be ignored by
|
||||
* blkg_[rw]stat_recursive_sum(). Transfer stats to the parent so
|
||||
* that they don't get lost. If IOs complete after this point, the
|
||||
* stats for them will be lost. Oh well...
|
||||
*/
|
||||
bfqg_stats_xfer_dead(bfqg);
|
||||
}
|
||||
|
||||
|
@ -788,46 +785,35 @@ static void bfq_end_wr_async(struct bfq_data *bfqd)
|
|||
|
||||
list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) {
|
||||
struct bfq_group *bfqg = blkg_to_bfqg(blkg);
|
||||
BUG_ON(!bfqg);
|
||||
|
||||
bfq_end_wr_async_queues(bfqd, bfqg);
|
||||
}
|
||||
bfq_end_wr_async_queues(bfqd, bfqd->root_group);
|
||||
}
|
||||
|
||||
static u64 bfqio_cgroup_weight_read(struct cgroup_subsys_state *css,
|
||||
struct cftype *cftype)
|
||||
{
|
||||
struct blkcg *blkcg = css_to_blkcg(css);
|
||||
struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
|
||||
int ret = -EINVAL;
|
||||
|
||||
spin_lock_irq(&blkcg->lock);
|
||||
ret = bfqgd->weight;
|
||||
spin_unlock_irq(&blkcg->lock);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int bfqio_cgroup_weight_read_dfl(struct seq_file *sf, void *v)
|
||||
static int bfq_io_show_weight(struct seq_file *sf, void *v)
|
||||
{
|
||||
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
|
||||
struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
|
||||
unsigned int val = 0;
|
||||
|
||||
spin_lock_irq(&blkcg->lock);
|
||||
seq_printf(sf, "%u\n", bfqgd->weight);
|
||||
spin_unlock_irq(&blkcg->lock);
|
||||
if (bfqgd)
|
||||
val = bfqgd->weight;
|
||||
|
||||
seq_printf(sf, "%u\n", val);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int bfqio_cgroup_weight_write(struct cgroup_subsys_state *css,
|
||||
struct cftype *cftype,
|
||||
u64 val)
|
||||
static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css,
|
||||
struct cftype *cftype,
|
||||
u64 val)
|
||||
{
|
||||
struct blkcg *blkcg = css_to_blkcg(css);
|
||||
struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
|
||||
struct blkcg_gq *blkg;
|
||||
int ret = -EINVAL;
|
||||
int ret = -ERANGE;
|
||||
|
||||
if (val < BFQ_MIN_WEIGHT || val > BFQ_MAX_WEIGHT)
|
||||
return ret;
|
||||
|
@ -871,13 +857,18 @@ static int bfqio_cgroup_weight_write(struct cgroup_subsys_state *css,
|
|||
return ret;
|
||||
}
|
||||
|
||||
static ssize_t bfqio_cgroup_weight_write_dfl(struct kernfs_open_file *of,
|
||||
char *buf, size_t nbytes,
|
||||
loff_t off)
|
||||
static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
|
||||
char *buf, size_t nbytes,
|
||||
loff_t off)
|
||||
{
|
||||
u64 weight;
|
||||
/* First unsigned long found in the file is used */
|
||||
return bfqio_cgroup_weight_write(of_css(of), NULL,
|
||||
simple_strtoull(strim(buf), NULL, 0));
|
||||
int ret = kstrtoull(strim(buf), 0, &weight);
|
||||
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
return bfq_io_set_weight_legacy(of_css(of), NULL, weight);
|
||||
}
|
||||
|
||||
static int bfqg_print_stat(struct seq_file *sf, void *v)
|
||||
|
@ -897,16 +888,17 @@ static int bfqg_print_rwstat(struct seq_file *sf, void *v)
|
|||
static u64 bfqg_prfill_stat_recursive(struct seq_file *sf,
|
||||
struct blkg_policy_data *pd, int off)
|
||||
{
|
||||
u64 sum = bfqg_stat_pd_recursive_sum(pd, off);
|
||||
|
||||
u64 sum = blkg_stat_recursive_sum(pd_to_blkg(pd),
|
||||
&blkcg_policy_bfq, off);
|
||||
return __blkg_prfill_u64(sf, pd, sum);
|
||||
}
|
||||
|
||||
static u64 bfqg_prfill_rwstat_recursive(struct seq_file *sf,
|
||||
struct blkg_policy_data *pd, int off)
|
||||
{
|
||||
struct blkg_rwstat sum = bfqg_rwstat_pd_recursive_sum(pd, off);
|
||||
|
||||
struct blkg_rwstat sum = blkg_rwstat_recursive_sum(pd_to_blkg(pd),
|
||||
&blkcg_policy_bfq,
|
||||
off);
|
||||
return __blkg_prfill_rwstat(sf, pd, &sum);
|
||||
}
|
||||
|
||||
|
@ -926,6 +918,41 @@ static int bfqg_print_rwstat_recursive(struct seq_file *sf, void *v)
|
|||
return 0;
|
||||
}
|
||||
|
||||
static u64 bfqg_prfill_sectors(struct seq_file *sf, struct blkg_policy_data *pd,
|
||||
int off)
|
||||
{
|
||||
u64 sum = blkg_rwstat_total(&pd->blkg->stat_bytes);
|
||||
|
||||
return __blkg_prfill_u64(sf, pd, sum >> 9);
|
||||
}
|
||||
|
||||
static int bfqg_print_stat_sectors(struct seq_file *sf, void *v)
|
||||
{
|
||||
blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
|
||||
bfqg_prfill_sectors, &blkcg_policy_bfq, 0, false);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static u64 bfqg_prfill_sectors_recursive(struct seq_file *sf,
|
||||
struct blkg_policy_data *pd, int off)
|
||||
{
|
||||
struct blkg_rwstat tmp = blkg_rwstat_recursive_sum(pd->blkg, NULL,
|
||||
offsetof(struct blkcg_gq, stat_bytes));
|
||||
u64 sum = atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) +
|
||||
atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]);
|
||||
|
||||
return __blkg_prfill_u64(sf, pd, sum >> 9);
|
||||
}
|
||||
|
||||
static int bfqg_print_stat_sectors_recursive(struct seq_file *sf, void *v)
|
||||
{
|
||||
blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
|
||||
bfqg_prfill_sectors_recursive, &blkcg_policy_bfq, 0,
|
||||
false);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
static u64 bfqg_prfill_avg_queue_size(struct seq_file *sf,
|
||||
struct blkg_policy_data *pd, int off)
|
||||
{
|
||||
|
@ -961,38 +988,14 @@ static struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int n
|
|||
return blkg_to_bfqg(bfqd->queue->root_blkg);
|
||||
}
|
||||
|
||||
static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
|
||||
{
|
||||
struct bfq_group_data *bgd;
|
||||
|
||||
bgd = kzalloc(sizeof(*bgd), GFP_KERNEL);
|
||||
if (!bgd)
|
||||
return NULL;
|
||||
return &bgd->pd;
|
||||
}
|
||||
|
||||
static void bfq_cpd_free(struct blkcg_policy_data *cpd)
|
||||
{
|
||||
kfree(cpd_to_bfqgd(cpd));
|
||||
}
|
||||
|
||||
static struct cftype bfqio_files_dfl[] = {
|
||||
{
|
||||
.name = "weight",
|
||||
.flags = CFTYPE_NOT_ON_ROOT,
|
||||
.seq_show = bfqio_cgroup_weight_read_dfl,
|
||||
.write = bfqio_cgroup_weight_write_dfl,
|
||||
},
|
||||
{} /* terminate */
|
||||
};
|
||||
|
||||
static struct cftype bfqio_files[] = {
|
||||
static struct cftype bfq_blkcg_legacy_files[] = {
|
||||
{
|
||||
.name = "bfq.weight",
|
||||
.read_u64 = bfqio_cgroup_weight_read,
|
||||
.write_u64 = bfqio_cgroup_weight_write,
|
||||
.flags = CFTYPE_NOT_ON_ROOT,
|
||||
.seq_show = bfq_io_show_weight,
|
||||
.write_u64 = bfq_io_set_weight_legacy,
|
||||
},
|
||||
/* statistics, cover only the tasks in the bfqg */
|
||||
/* statistics, covers only the tasks in the bfqg */
|
||||
{
|
||||
.name = "bfq.time",
|
||||
.private = offsetof(struct bfq_group, stats.time),
|
||||
|
@ -1000,18 +1003,17 @@ static struct cftype bfqio_files[] = {
|
|||
},
|
||||
{
|
||||
.name = "bfq.sectors",
|
||||
.private = offsetof(struct bfq_group, stats.sectors),
|
||||
.seq_show = bfqg_print_stat,
|
||||
.seq_show = bfqg_print_stat_sectors,
|
||||
},
|
||||
{
|
||||
.name = "bfq.io_service_bytes",
|
||||
.private = offsetof(struct bfq_group, stats.service_bytes),
|
||||
.seq_show = bfqg_print_rwstat,
|
||||
.private = (unsigned long)&blkcg_policy_bfq,
|
||||
.seq_show = blkg_print_stat_bytes,
|
||||
},
|
||||
{
|
||||
.name = "bfq.io_serviced",
|
||||
.private = offsetof(struct bfq_group, stats.serviced),
|
||||
.seq_show = bfqg_print_rwstat,
|
||||
.private = (unsigned long)&blkcg_policy_bfq,
|
||||
.seq_show = blkg_print_stat_ios,
|
||||
},
|
||||
{
|
||||
.name = "bfq.io_service_time",
|
||||
|
@ -1042,18 +1044,17 @@ static struct cftype bfqio_files[] = {
|
|||
},
|
||||
{
|
||||
.name = "bfq.sectors_recursive",
|
||||
.private = offsetof(struct bfq_group, stats.sectors),
|
||||
.seq_show = bfqg_print_stat_recursive,
|
||||
.seq_show = bfqg_print_stat_sectors_recursive,
|
||||
},
|
||||
{
|
||||
.name = "bfq.io_service_bytes_recursive",
|
||||
.private = offsetof(struct bfq_group, stats.service_bytes),
|
||||
.seq_show = bfqg_print_rwstat_recursive,
|
||||
.private = (unsigned long)&blkcg_policy_bfq,
|
||||
.seq_show = blkg_print_stat_bytes_recursive,
|
||||
},
|
||||
{
|
||||
.name = "bfq.io_serviced_recursive",
|
||||
.private = offsetof(struct bfq_group, stats.serviced),
|
||||
.seq_show = bfqg_print_rwstat_recursive,
|
||||
.private = (unsigned long)&blkcg_policy_bfq,
|
||||
.seq_show = blkg_print_stat_ios_recursive,
|
||||
},
|
||||
{
|
||||
.name = "bfq.io_service_time_recursive",
|
||||
|
@ -1099,32 +1100,41 @@ static struct cftype bfqio_files[] = {
|
|||
.private = offsetof(struct bfq_group, stats.dequeue),
|
||||
.seq_show = bfqg_print_stat,
|
||||
},
|
||||
{
|
||||
.name = "bfq.unaccounted_time",
|
||||
.private = offsetof(struct bfq_group, stats.unaccounted_time),
|
||||
.seq_show = bfqg_print_stat,
|
||||
},
|
||||
{ } /* terminate */
|
||||
};
|
||||
|
||||
static struct blkcg_policy blkcg_policy_bfq = {
|
||||
.dfl_cftypes = bfqio_files_dfl,
|
||||
.legacy_cftypes = bfqio_files,
|
||||
|
||||
.pd_alloc_fn = bfq_pd_alloc,
|
||||
.pd_init_fn = bfq_pd_init,
|
||||
.pd_offline_fn = bfq_pd_offline,
|
||||
.pd_free_fn = bfq_pd_free,
|
||||
.pd_reset_stats_fn = bfq_pd_reset_stats,
|
||||
|
||||
.cpd_alloc_fn = bfq_cpd_alloc,
|
||||
.cpd_init_fn = bfq_cpd_init,
|
||||
.cpd_bind_fn = bfq_cpd_init,
|
||||
.cpd_free_fn = bfq_cpd_free,
|
||||
|
||||
static struct cftype bfq_blkg_files[] = {
|
||||
{
|
||||
.name = "bfq.weight",
|
||||
.flags = CFTYPE_NOT_ON_ROOT,
|
||||
.seq_show = bfq_io_show_weight,
|
||||
.write = bfq_io_set_weight,
|
||||
},
|
||||
{} /* terminate */
|
||||
};
|
||||
|
||||
#else
|
||||
#else /* CONFIG_BFQ_GROUP_IOSCHED */
|
||||
|
||||
static inline void bfqg_stats_update_io_add(struct bfq_group *bfqg,
|
||||
struct bfq_queue *bfqq, int rw) { }
|
||||
static inline void bfqg_stats_update_io_remove(struct bfq_group *bfqg,
|
||||
int rw) { }
|
||||
static inline void bfqg_stats_update_io_merged(struct bfq_group *bfqg,
|
||||
int rw) { }
|
||||
static inline void bfqg_stats_update_completion(struct bfq_group *bfqg,
|
||||
uint64_t start_time, uint64_t io_start_time, int rw) { }
|
||||
static inline void
|
||||
bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
|
||||
struct bfq_group *curr_bfqg) { }
|
||||
static inline void bfqg_stats_end_empty_time(struct bfqg_stats *stats) { }
|
||||
static inline void bfqg_stats_update_dequeue(struct bfq_group *bfqg) { }
|
||||
static inline void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg) { }
|
||||
static inline void bfqg_stats_update_idle_time(struct bfq_group *bfqg) { }
|
||||
static inline void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) { }
|
||||
static inline void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) { }
|
||||
|
||||
static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
|
||||
struct bfq_group *bfqg) {}
|
||||
|
||||
static void bfq_init_entity(struct bfq_entity *entity,
|
||||
struct bfq_group *bfqg)
|
||||
|
@ -1139,37 +1149,26 @@ static void bfq_init_entity(struct bfq_entity *entity,
|
|||
entity->sched_data = &bfqg->sched_data;
|
||||
}
|
||||
|
||||
static struct bfq_group *
|
||||
bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
|
||||
{
|
||||
struct bfq_data *bfqd = bic_to_bfqd(bic);
|
||||
return bfqd->root_group;
|
||||
}
|
||||
|
||||
static void bfq_bfqq_move(struct bfq_data *bfqd,
|
||||
struct bfq_queue *bfqq,
|
||||
struct bfq_entity *entity,
|
||||
struct bfq_group *bfqg)
|
||||
{
|
||||
}
|
||||
static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) {}
|
||||
|
||||
static void bfq_end_wr_async(struct bfq_data *bfqd)
|
||||
{
|
||||
bfq_end_wr_async_queues(bfqd, bfqd->root_group);
|
||||
}
|
||||
|
||||
static void bfq_disconnect_groups(struct bfq_data *bfqd)
|
||||
{
|
||||
bfq_put_async_queues(bfqd, bfqd->root_group);
|
||||
}
|
||||
|
||||
static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
|
||||
struct blkcg *blkcg)
|
||||
static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
|
||||
struct blkcg *blkcg)
|
||||
{
|
||||
return bfqd->root_group;
|
||||
}
|
||||
|
||||
static struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
|
||||
static struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
|
||||
{
|
||||
return bfqq->bfqd->root_group;
|
||||
}
|
||||
|
||||
static struct bfq_group *
|
||||
bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
|
||||
{
|
||||
struct bfq_group *bfqg;
|
||||
int i;
|
||||
|
|
3590
block/bfq-iosched.c
3590
block/bfq-iosched.c
File diff suppressed because it is too large
Load diff
1497
block/bfq-sched.c
1497
block/bfq-sched.c
File diff suppressed because it is too large
Load diff
821
block/bfq.h
821
block/bfq.h
File diff suppressed because it is too large
Load diff
Loading…
Add table
Reference in a new issue