Tuning NetBSD VM behaviour (swap usage)

Index

Introduction
1. Terminology
2. Overview of memory usage
Controlling the big picture
Memory for a process
1. Per-process resource limits
2. Setting the per-process resource limits
Related issues
1. Tools
2. Tips
References

Preface / disclaimer

This document grew out of need to understand and control memory management of a NetBSD-current system. I am not a NetBSD developer, nor can I claim any sort of understanding of the system. I have tried to find the relevant information regarding the issues at hand, including manual pages, external documents such as web pages, mailing lists, and of course, reading the source code. However, the most influential source has been Charles Cranor's dissertation thesis "Design and Implementation of the UVM Virtual Memory System", and I highly recommend it to anyone interested in NetBSD's VM subsystem functions.

Due to the complex nature of the VM subsystem, and my greatly lacking knowledge of the data structures used throughout the code, the reader is encouraged to assume that this document is full of mistakes, errors and misleading information. All of them are the result of my inability to collect, understand and express the related information correctly. Furthermore, I have concentrated mostly on NetBSD-current, which is a moving target at best, so even if there were no factual errors in this document, it is still probably outdated by the time you read it. Some of the information presented here might be valid for both NetBSD 1.6 and 2.0 releases, but as I don't run them, I can't say for sure. Thus, the reader should consider this document merely as a starting point, and make his/her own background checks before following any suggestions given here.

Specifically, do not blame me for anything that you do!

1. Introduction

Once upon a time, I was looking at a running NetBSD-current system that was acting as a transparent web cache/proxy. It ran squid, and had 1 GB of RAM, mainly to allow the cache lots of fast storage space. However, there were two problems with the squid process: once it grew to about 350 MB size, the system started to move pages of memory to swap, and once the process grew to about 400 MB and forked, the whole system froze partially. Both events were something I wanted to avoid, and therefore I needed to control the memory usage of the system. Fairly soon after starting to look into the issues involved, I realized that the documentation was not answering the kind of questions I was asking. This document tries to fill a part of that gap, and hopefully help others who might have the same kind of questions that I had.

This document focuses mainly on trying to avoid swap. As was mentioned on tech-kern mailing list, one could just disable swap, but there may be situations where the sort of overflow buffer offered by swap is needed, even though the normal, steady state of the system can manage without it. And some systems just don't have enough memory, but one would still like to control what goes to swap, and what is kept in RAM. As time, interests and my understanding allows, I may increase the scope of this document to other areas of memory management. All actions should still remain in the range of offered tunable parameteres, ie. the most one might need to do is to compile a new, differently configured kernel. No source modifications are done, as that will probably just lead to major problems. If you want to do that, and know how to improve the system that way, then you definitely don't need this document.

In a NetBSD system, the memory is managed by the kernel, and used by the kernel and other running processes. Specifically, the memory is managed by the VM subsystem, and in NetBSD the latest implementation is known as UVM. Before one can efficiently control how the memory is used, one needs to know how the memory is being used by default. And to understand that, one needs to understand the terms used. There are various tools available for checking all sorts of things "under the hood", but most of the manual pages seem to cover just the technical aspects of running the program, and assume that the user is already familiar with the semantic aspects of it. For example, you could run 'top', and observe the memory usage shown at the top of its output. But do you know which numbers to add up to get the total memory usage? Why is inactive memory usage typically half of active memory? Why is there so much inactive memory? What is wired memory?

1.1. Terminology

In this chapter I will try to address the terminology involved. Most of the readers may well be familiar with them, but I wanted to cover them anyway, since they keep on showing up here and there, and getting them wrong is one way to get the whole thing wrong.

active memory: Part of virtual memory that is considered to be in active use, and so it is resident.
NetBSD tries to keep the ratio of active to inactive memory as 2:1
anonymous memory: In a UNIX-like operating system, almost everything is a file. If it is a file, then it has a name. If the contents of that file end up in virtual memory, then one could still consider that part of the memory to have a name: the file name. If one needs to use that part of the memory for something else, then nothing much is really lost, as the contents can always be found from the file. On the other hand, if the contents of some part of the memory did not come from a file, then that part has no name, it is anonymous.
In NetBSD the amount of anonymous memory kept resident is controlled by the sysctl variables vm.anonmin and vm.anonmax (and indirectly by vm.execmin, vm.execmax, vm.filemin, vm.filemax and vm.bufcache).
buffer cache: The old, traditional cache (from before UBC) for various things, such as file system metadata.
executable memory: Part of virtual memory that contains executable code, typically from a (program) file.
file cache (page cache): Part of virtual memory that caches contents of recently accessed files. Any file reads will cause memory consumption, and if the contents can not already be found from the file cache, then some memory will need to be allocated for them. Thus, file reads can cause file cache to grow, with the expense of anonymous and executable pages.
In NetBSD the amount of file cache used is controlled by the sysctl variables vm.filemin and vm.filemin (and indirectly by vm.anonmin, vm.anonmax, vm.execmin, vm.execmax, and vm.bufcache).
free memory: Part of virtual memory that is readily available to any entity requesting memory.
NetBSD tries to keep the amount of free memory relatively low, as memory not needed by processes can be used for caching purposes.
NetBSD tries to set the low water mark for free memory to somewhere between 16 kB and 256 kB, depending on available RAM. As soon as free memory falls below that amount, the system tries to free more. Pages will be freed until about five times the low water mark are free (there are some others reasons why it would stop before that, though).
inactive memory: Part of virtual memory that is still resident and has valid content, but is marked as not used. Since the content is valid, the page can be easily re-activated, if needed. On the other hand, if virtual memory is needed for something else, inactive memory can be freed relatively easily.
NetBSD tries to keep the ratio of active to inactive memory as 2:1
page: Unit of virtual memory. Typically it is around 4 kB in size.
In NetBSD you can check the page size with `sysctl hw.pagesize´
resident memory: Part of virtual memory that is kept in RAM.
swap: Typically a slow mass storage device (or devices) that holds memory pages that don't fit to RAM. When RAM fills up, some of the pages are moved (paged out) to swap, to make room for more urgently needed pages. When the paged out pages are needed, and there is room in RAM for them, they will be paged back in. In a low memory situation this may happen at the expense of some other pages moving temporarily to swap.
virtual memory: UNIX-like operating systems have their own memory space that is not limited to the amount of physical memory (RAM) present. Since there is no direct relationship between physical memory space and memory space offered by the OS, the OS memory is considered to be virtual. It is then the responsibility of the VM (virtual memory) subsystem to keep track of how the virtual addresses are mapped to physical ones, and what parts of the virtual memory space are kept at what parts of actual, physical memory, including swap (so this could include all sorts of memory spaces in addition to RAM).
wired memory: Part of virtual memory marked to always stay resident.
In NetBSD, the total amount of wired memory is limited to one third of the available RAM.
The amount of wired memory that a single process can have is controlled by the per-process resource limit "locked memory".

1.2. Overview of memory usage

First, it is a good idea to get an overview of what is currently going on in a running system. This will give us an idea of the available memory, and how it is being used. To do that, one could run `vmstat -s´, and then observe the results. The following shows some lines from the output:

  vmstat -s | head -22                COMMENTS
  ---------------------------------------------------------------------
     4096 bytes per page              page size                (4 kB)
       16 page colors
    47521 pages managed               available memory         (~185 MB)
      449 pages free                  free memory              (1796 kB)
    25006 pages active                active memory            (~98 MB)
    12668 pages inactive              inactive memory          (~50 MB)
        0 pages paging
      709 pages wired                 wired memory             (2836 kB)
        0 zero pages
        1 reserve pagedaemon pages
        5 reserve kernel pages
    52456 anonymous pages             anonymous memory         (~205 MB)
    29374 cached file pages           file cache               (~115 MB)
     3345 cached executable pages     executable pages         (~13 MB)
       64 minimum free pages          freemin                  (256 kB)
       85 target free pages           4/3 * freemin            (340 kB)
    12719 target inactive pages
    15840 maximum wired pages
        1 swap devices
    49139 swap pages                  swap available           (~192 MB)
    29857 swap pages in use           swap in use              (~117 MB)
  1977880 swap allocations

Compare the above with what top was showing some minutes later:
(output sorted by resident size; type 'ores<enter>' in top)

load averages:  0.38,  0.49,  0.60                                     12:38:47
101 processes: 100 sleeping, 1 on processor
CPU states:  0.0% user,  0.0% nice,  1.0% system,  0.0% interrupt, 99.0% idle
Memory: 99M Act, 50M Inact, 2732K Wired, 13M Exec, 114M File, 876K Free
Swap: 192M Total, 117M Used, 75M Free

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
   11 root      18    0     0K   26M syncer   335:28  0.00%  0.00% [ioflush]

We'll note that there is about 190 MB of RAM in the system, and it is not enough, since about 117 MB of memory has been paged out to swap. Furthermore, at the same time as slow swap is being used for processes, fast RAM is being used for file cache! If one could reduce the file cache usage down, while keeping more of the anonymous pages in RAM, the system might feel more responsive (assuming that file IO is not one of the main functions).

Note that there are more anonymous pages than available memory, so one can not avoid using swap, without somehow reducing the memory usage first. However, executable pages don't seem to take too much memory, but since they contain executable code, one would probably want to keep them in RAM.

The above example is a "desktop" system running as a small server with limited hardware. If that was a laptop, then a slow hard disk would make life miserable, as everything fetched from swap would take a long time compared to RAM.

To check the system buffer/cache usage/utilization, one can run `systat bufcache´ periodically. This will be especially useful once one starts to tune vm.bufcache and related parameters.

2. Controlling the big picture

There are two parts one needs to check and control to achieve the overall memory performance: resource limits for individual processes, and resource limits for the total use of available memory. We'll do a top down approach here, looking at the bigger picture first, and worrying about individual processes later. Note that these two are related, though. Parts of a process could end up in swap even if the resource limits for the totals (like vm.anonmin) seem to be OK (or vice versa: process limits are OK, but totals are still off causing unnecessary use of swap).

The main controls for memory usage in NetBSD are: vm.anonmin, vm.anonmax, vm.execmin, vm.execmax, vm.filemin, and vm.filemax. In addition to those, the buffer cache has its own controls, mainly vm.bufcache, vm.bufmem_lowater, and vm.bufmem_hiwater.

Defaults for these limits may well be reasonable for most systems. On the other hand, no default can be reasonable for all situations, so there could be cases where one needs to modify the limits. Before touching them, we need to cover the NetBSD page daemon, since it is the one using most of these limits, and also the one responsible for deciding which pages go to inactive list, and from inactive list to free list (and possibly paged out to swap in the process).

2.1. NetBSD page daemon

The NetBSD VM subsystem has several parts dealing with different kinds of situations that relate to memory management. Our interest focuses on the part that handles the situation when there is little free memory left. The main component is the page daemon, that normally sleeps while everything else is running smoothly. However, once the amount of free memory drops below a given level, the page daemon is awakened to fix the memory shortage. Simply put, once it has freed enough memory, it goes back to sleep again.

One of the main triggers for waking up page daemon is that free memory has dropped below the minimum allowed (freemin). Should the page daemon wake up for any reason, it will check if the amount of free memory is below a certain target value (freetarg), and if it is, it will start scanning the list of inactive pages to find pages it could free. After that, it will go through the active list, and deactivate some of the pages there, to keep the active to inactive memory ratio around 2:1.

The basic idea is that there are two main lists: active pages, and inactive pages. Pages from each of these lists may move to the end of the other list (ie. pages de-activated will go from active list to the end of inactive list, and vice versa). Wired pages are kept in a separate list, as active and inactive lists are used for deciding what pages to free (and possibly page out to swap), and wired pages should be kept resident. All these lists are ordered (linked) lists, where pages are added to the end, and lists are travelled starting from the first. The page daemon will start by going through the inactive list, starting from the first page (which is the least recently added page in the list), and if possible, free the page. The contents of the page might need to be paged out to swap. Once enough pages have been freed, the page daemon will calculate a new target for the amount of inactive pages (to achieve that 2:1 ratio). It then goes through the active list, and marks pages as inactive until the target amount is reached. Finally, it will try to remove a small amount of bytes from buffer cache (roughly one third of freemin size, or about 5 to 100 kB's worth depending on RAM size). Page daemon is now done.

The above is a simplified view, but it should be in enough detail to understand the memory managment for our purposes. Assume that pages from the inactive list may be re-activated by other parts of the VM subsystem, eg. if their contents are needed. So, presumably useful stuff in the inactive list will not stay there long enough to be considered by the page daemon. Therefore, without any additional limits, the logic of page daemon follows a simple Least Recently Used algorithm, where the least recently used pages will be freed. First of all, the least recently used pages in active list go to the inactive list. And secondly, the least recently used pages in the inactive list will be freed. However, simply using the LRU type page selection for freeing pages would result in useful pages ending up in swap, eventually. To offer better control over the page daemon, some additional limits have been introduced.

2.2. Meaning of vm.{anon,exec,file}{min,max} limits

There are six parameters for controlling what sort of pages should (not) be freed. For each of the three types of pages (anonymous, executable, file cache) there is a lower limit and an upper limit. According to manual pages (see `man 3 sysctl´, and look for VM_ANONMAX), these are percentages of physical memory, but currently this is very different from how they are actually used. For the purpose of understanding how these limits work, though, it is not relevant whether they are percentages of A or B. Of course, when actually changing these values, realities need to be taken into account.

Each lower limit defines how many pages of that particular type should be kept active. In other words, when page daemon is going through the inactive list and encounters a page for which the total usage is below the given lower limit, then instead of wanting to free the page, it will re-activate that page, and move forward on the list. So, the total of all three lower limits should be less than the available memory left after kernel's own usage (if the percentages were based on total RAM, and not the current "active + inactive + free" sum).

The upper limits are not as simple as lower limits. Instead of considering any single page type (and the related limits), lets look at all three. Should the usage of all three types of pages fall below their respective lower limits, then they will all be re-activated when encountered on inactive list, as explained above. Should they all fall below their respective upper limits, then only those types of pages will be considered for freeing that also happen to be above their lower limit. If any type of page usage is above the respective upper limit, then only those types of pages are considered for freeing, and others will be re-activated. See table below:

anon/exec/file usage what gets re-activated (avoids swap)

all below lower limits all re-activated

all below upper limits re-activate those below lower limit

any above upper limits re-activate those below upper limit

anon/exec/file usage	what gets re-activated (avoids swap)
all below lower limits	all re-activated
all below upper limits	re-activate those below lower limit
any above upper limits	re-activate those below upper limit

There are some additional things worth noting. The upper limits can be reached, and "broken". If there is enough free memory for anybody wanting to use it, the usage can keep on growing as long as supplies last. After all, page daemon is sleeping. Also, once the daemon wakes up, it may happen that pages of one type (over the limit) are at the beginning of the list, and some other type of pages (over the limit) are at the end. So, only some of the different types of pages over their respective limits might be freed, depending on how they are located in the inactive list. Just as with lower limits, the upper limits should be low enough to guarantee that once all available memory is used, at least one of the page types is above its upper limit.

2.3. Setting the vm.{anon,exec,file}{min,max} limits

Now we are able to determine the "proper" limits for anonymous, executable, and file cache type pages. First of all, in NetBSD-current, the percentages are not of physical memory, but rather from the sum of "active + inactive + free". This means that the actual limits could be over 60% lower than what they would, based on physical memory. Since both wired memory and kernel's own memory are excluded, the limits are difficult to predict. Both wired and kernel memory are also dynamic, meaning that the limits will change as wired/kernel memory usage changes.

After looking at the top and vmstat outputs, one should calculate the percentages of anonymous, executable and file cache pages, based on the known "active + inactive + free" base line. Lets assume the following scenario (which is pretty much what I had):

total RAM is 1 GB
active memory is 560 MB
inactive memory is 280 MB
wired memory is 4 MB
executable memory is 8 MB
file cache is 440 MB
free memory is 1 MB
anonymous memory is 535 MB
only one large process in memory, using ~400 MB
the large process forks from time to time

Note that I will be dealing with a steady state, where the available RAM is sufficient for normal operations. The same may apply to desktops, laptops and servers, but if you are dealing with insufficient hardware, and constantly changing operating modes, my simple rules/logic may not apply.

Instead of just showing some limits that I chose for this scenario, I'll try to give the reasoning behind them. Hopefully, it can be expanded to other, different kinds of scenarios. What I had to consider here was how much of the RAM I was going to be allocating to any particular type of memory. The OS would be using "active + inactive + free" as a base line, so in my case that would be about 700..900 MB. All vm.{anon,exec,file}{min,max} values would be percentages of that. I could have chosen to use the lowest number (to get the highest percentages), but I chose a more often occuring average of ~800 MB as a base line.

If there is enough RAM for normal, steady state, then I think file cache is the way to control the rest. Ideally, the free space between what processes require and what is available would be managed by the kernel so that file cache uses most it, and should some transient processes need memory, it would be reclaimed from file cache. vm.filemax would be as low as possible, probably 1 percentage point above vm.filemin (or even the same). I have not figured out why the upper limit should be much higher than the lower limit; UVM will give file cache more memory, if possible. Hopefully, a low vm.filemax will make file cache to "always" be above the upper limit, forcing anonymous and executable pages to get re-activated as long as they remain below their respective upper limits.

If the normal operation required lots of file reads (NFS server perhaps), one could set vm.filemin and vm.filemax to resonably high values, leaving anonymous and executable limits as low as possible, without them actually going above their respective upper limits. This would leave most of the memory to file cache and kernel, while still allowing the limited number of processes to keep running in resident memory. vm.filemax would still be set so that file cache is above the upper limit, and thus it keeps on re-using older parts of itself for more recent file accesses.

Once the file cache limits are set, one can decide for the executable and anonymous pages. Since executable pages contain running code, that would be my next concern. I've tried to set vm.execmin a bit higher than where the steady state usage runs, and then based on how much the usage varies over time, set the vm.execmax accordingly.

Finally, I've tried to offer anonymous pages as much memory as possible. Again, I've tried to estimate how high the usage can go, and then guarantee as much of it as possible by setting vm.anonmin. vm.anonmax was then set so that the sum of all max values was about 90-95%. As long as the percentages are based on a number that excludes kernel memory usages (and possibly others), then the sum could well be 100%, or even more.

So, coming back to the example scenario. I know that I have memory to spare, so I let UVM play with buffer cache, file cache, etc, and set vm.filemin to 0, and vm.filemax to 1. That pretty much guarantees that file cache is always above the upper limit (which is now 1% of ~800 MB, or ~8 MB). Current executable page usage is ~1% (8 MB of ~800 MB), so I've set vm.execmin to be 2%, and vm.execmax as 4%. The upper limit can be too small, if one suddenly starts to run large programs, but in my scenario that 4% is plenty. Finally, I know that I want a lot of anonymous memory for that large process. 400 MB would be ~50%, so I set vm.anonmin to 70%, and raised vm.anonmax as high as I could. Setting it to 95% means that the sum of my upper limits is 100%. I could have kept the sum around 95%, in case the underlying calculations change in the OS. Of course, one should also take buffer cache usage into consideration.

One can check the limits with `sysctl vm´:
(only partial output shown here)

vm.anonmin = 70
vm.execmin = 2
vm.filemin = 0
vm.anonmax = 95
vm.execmax = 4
vm.filemax = 1

Note that this is not from the same system as examples given in 1.2. Any single limit could be set with eg. `sysctl -w vm.filemax=1´. To make the changes permanent, they can be added to /etc/sysctl.conf:

# Tune VM usage
vm.anonmin=70
vm.anonmax=95
vm.execmin=2
vm.execmax=4
vm.filemin=0
vm.filemax=1

The following table tries to list some of the typical scenarios, and what one might want to consider when adjusting the anon/exec/file limits for them. Some of the scenarios also involve high network traffic, which may have additional memory requirements.

typical use things to consider

desktop workstation web browsers, office software use a fair amount of anon/exec;
reduce file cache to useful minimum, give the rest to apps;
you'll probably see by the end of the day how the memory gets used
file server lots of file IO, consider higher buffer & file cache;
reduce anon/exec usage to minimum neccessary
web server/cache lots of file IO, consider higher buffer & file cache;
may differ from typical file server due to higher anon/exec usage;
apps may try to cache file content in anonymous memory, making file cache less useful (and possibly wasting memory), so you need to check where and how the caching takes place

typical use	things to consider
desktop workstation	web browsers, office software use a fair amount of anon/exec; reduce file cache to useful minimum, give the rest to apps; you'll probably see by the end of the day how the memory gets used
file server	lots of file IO, consider higher buffer & file cache; reduce anon/exec usage to minimum neccessary
web server/cache	lots of file IO, consider higher buffer & file cache; may differ from typical file server due to higher anon/exec usage; apps may try to cache file content in anonymous memory, making file cache less useful (and possibly wasting memory), so you need to check where and how the caching takes place

2.4. Buffer cache (vm.bufcache)

In order to avoid caching data in multiple places (and wasting memory by doing so), a new caching/buffering scheme was introduced into NetBSD soon after UVM. This system, called UBC, takes advantage of UVM's new features to achieve efficient data caching inside the kernel. It is responsible for the file cache now, among other things. The old, traditional way of caching data is still used for file system metadata, and in late 2003 it was modified to dynamically allocate memory, instead of making a static reservation.

There are several ways to control the buffer cache's memory usage: two kernel config options and three sysctl variables. All of these control two limits: minimum and maximum memory usage of the buffer cache. As with the anon/exec/file limits, the lower limit is "guaranteed", whereas the upper limit can be exceeded.

Kernel config option BUFCACHE is defined to be a percentage of total available RAM. It specifies the upper limit for buffer cache, and the lower limit is then derived from the upper limit: lower limit is 1/8 of the upper limit, or at least 64 kB. However, BUFCACHE (and the corresponding vm.bufcache) is restricted to 5%..95% range. In a large memory system 5% can be too much. The default is 15%.

To allow for better control over the BUFCACHE, another kernel config option was added: BUFPAGES. It controls the same upper limit as BUFCACHE, but without the 5..95 range limit, and it is expressed in units of pages. Again, the lower limit is derived from this value by dividing it with 8.

The buffer cache limits can also be changed at runtime. Sysctl variable vm.bufcache can be used just like the BUFCACHE kernel config option. And finally, both the lower and upper limit can be individually set with byte resolution using vm.bufmem_lowater and vm.bufmem_hiwater. The current usage can be checked with `sysctl vm.bufmem´. Note that when changing the limits, the order is important, as some internal checks are made to ensure that there is at least 16 bytes between the two limits. Therefore, you should raise the hiwater mark first, when raising the limits from current ones, and lower the lowater mark first when lowering the usage from current situation.

There are some things worth noting with regard to buffer cache usage. First of all, all of the memory the buffer cache is using reduces the limits set by vm.anonmin and friends. That is, the percentages of anon/exec/file limits are calculated from a sum that gets smaller as buffer cache usage increases (and vice versa). Secondly, when the system is low on free memory and page daemon starts to free pages, buffer cache memory is not immediately available, even if the actual usage is above the hiwater mark. This means that even though buffer cache may not be as aggressive to use available memory, once it gets it, it will not give it up as easily as file cache would.

As I don't yet feel that I have a good enough understanding of the buffer cache, I have not tried to tune it much. First, I set the vm.bufmem_lowater and vm.bufmem_hiwater to some reasonable values, keeping the upper limit as eight times the lower limit (as I don't know whether there are design issues behind the ratio). I've also kept the numbers as multiples of page size, just in case. Once the stable state of the system seemed reasonable, I set the kernel config option BUFPAGES to match the runtime value of vm.bufmem_hiwater (converted to units of pages, ie. divided by hw.pagesize), compiled a new kernel and booted it up. In a large memory system vm.bufcache was too coarse for my liking, so I haven't really used it much after initially setting it to 5% (and realizing the limits were still too large and not even hard limits).

Note that official NetBSD documentation recommends not setting any of the BUFCACHE, NBUF, or BUFPAGES kernel config options.

XXX: It would be nice to know some real world numbers regarding buffer cache size (in relation to system usage).

2.5. page daemon statistics

The page daemon keeps some statistics about its actions. These can be seen with `vmstat -s´:
(only partial output shown)

   239922 times daemon wokeup
     7962 revolutions of the clock hand
     7952 times daemon attempted swapout
  2195574 pages freed by daemon
  7058156 pages scanned by daemon
        0 anonymous pages scanned by daemon
  1969498 object pages scanned by daemon
  2547860 pages reactivated
       36 pages found busy by daemon
        0 total pending pageouts
  7608253 pages deactivated

First thing we notice here is that the daemon gets poked a lot: some 240,000 times (for a system that had been up a bit over eight days). However, of all those times, it only needed to start scanning the active and inactive page lists less than 8,000 times (clock hand revs), making it about 1,000 times per 24 hours, or once every ~90 seconds, on average. During those ~8,000 times, it scanned some 7,060,000 pages on the inactive list, or an average of about 900 pages per scan. Of all those scanned inactive pages, it freed about 2,200,000, or roughly every third page. No anonymous pages were considered, meaning all anon pages on inactive list get re-activated, meaning that so far the vm{anon,exec,file}{min,max} limits have been effective. Since pages could get re-activated by other means besides the page daemon, we can't say for sure how many of the re-activations were done by the daemon. The same holds for page de-activations, although most of the time these counters probably get incremented by the page daemon. You'll note that if you do: 7058156 - 2195574 - 2547860 - 1969498 you get 345224, which is far from zero. I don't yet know why this is so. One explanation might be that the actual re-activations of anon/exec/file pages are not shown by vmstat (they seem to have their own counters), and the re-activation counter shows just those pages that were referenced by some other entity, making them poor candidates for freeing. (XXX: ???)

The above example comes from the 1GB system running squid in a steady state. Most of the memory pressure comes from constant file reads by squid (accessing the disk cache). On my desktop system with memory to spare, and a very steady, low workload, the page daemon hardly ever wakes up, and almost the only reason for doing so is to start scanning the page lists. On average, the page daemon scans pages every 20 minutes. A one terabyte NFS server with less than perfectly tuned limits scans the page lists on average every six seconds, and that system is not heavily used.

3. Memory for a process

In NetBSD each process has its own memory space. This memory is initialized by the VM subsystem, and just as there are limits for the total memory usage for the whole system, there are limits for individual processes. These limits affect how the memory space for the process is allocated, and an overview looks something like this:

<CODE/TEXT>:
executable code of the program

<DATA>:
initialized variable data of the process

<BSS>:
un-initialized variable data

<unallocated>:
shared libraries, mmap'd file data

<STACK>:
program stack, local variables

3.1. Per-process resource limits

To protect the system from resource exhaustion, each process has a set of resource limits it can't exceed. If these limits are properly set, then all processes are able to use enough resources to get their job done, but no process can cause the others to fail due to some resource running out. To allow for some flexibility, and for the process to know when it is getting close to the maximum usage, the limits have been divided into soft and hard limits. When the soft limits are reached, the process is notified through a signal. The pre-process resource limits include CPU time, file descriptor usage, memory usage, and others.

Each process inherits the resource limits from its parent, and typically, the per-process limits for the shell can be read or set using a shell built-in command such as `limit´. The exact syntax depends on the shell used, so check your shell's documentation. From the shell, the limits get passed on to all processes started from the shell. Here, we are interested in the memory limits, so lets start by looking at what tcsh's `limit´ shows us after having run `unlimit´:
(only partial output shown)

datasize        1048576 kbytes
stacksize       32768 kbytes
memoryuse       508136 kbytes
memorylocked    508136 kbytes

As one can guess, the datasize refers to the DATA segment of the process' memory space. Here, 'memorylocked' limits how much wired memory the process may have, and memoryuse refers to the maximum resident size of the process. See `man 3 sysctl´ and look for PROC_PID_LIMIT for additional information on these and other limits.

Note that there are in fact two sets of limits: soft and hard limits. When the process reaches the soft limit, it gets notified through a signal, but can still continue using additional resources until the hard limit is reached. What happens when the hard limits are reached depends on how the process handles such situations (if at all). For the purposes of this document, we can assume the soft and hard limits to be the same, as we are merely interested in the total memory usage (as oppposed to fine tuning the specific behaviour of any particular process).

As was mentioned earlier, the VM subsystem uses some of the resource limits when initializing the process' memory space. The memory space allocation (or map) of any process can be seen with `pmap -p pid´. To look at the kernel's map, use PID 0. By default, pmap will show its parent's map, so this is what it looks like for the shell I used for showing the memory specific resource limits above:

08048000    264K read/exec         /usr/pkg/bin/tcsh
0808A000     16K read/write        /usr/pkg/bin/tcsh
0808E000    972K read/write          [ anon ]
4808A000     40K read/exec         /libexec/ld.elf_so
48094000      4K read/write        /libexec/ld.elf_so
48095000      4K read/write          [ anon ]
48096000      4K read/exec           [ uvm_aobj ]
48097000     32K read/write          [ anon ]
4809F000      8K read/exec         /lib/libtermcap.so.0.5
480A1000      4K read/write        /lib/libtermcap.so.0.5
480A2000     20K read/exec         /lib/libcrypt.so.0.1
480A7000      4K read/write        /lib/libcrypt.so.0.1
480A8000     12K read/write          [ anon ]
480AB000    656K read/exec         /lib/libc.so.12.123
4814F000     28K read/write        /lib/libc.so.12.123
48156000     52K read/write          [ anon ]
BDC00000  30720K                     [ stack ]
BFA00000   1920K read/write          [ stack ]
BFBE0000     64K read/write          [ stack ]
BFBF0000     64K read/write          [ stack ]
 total     4168K

Note how this matches quite nicely with the overview shown in 3. If you look at the hexadecimal addresses on left, you'll see that there is some anonymous memory starting at 0808E000, and then some shared libraries at 4808A000. Subtracting these, we get roughly 40000000, which translates to about 4*16^7 bytes, or about 1048576 kbytes. This corresponds with the datasize hard limit of the process. Similarly, there is already space allocated for the maximum stack size.

Furthermore, we can see the amount anonymous memory used by the process. As with the shared libraries, we can not assume that this is all exclusively used by this particular process, altough most of the anonymous memory probably is not shared by others.

3.2. Setting the per-process resource limits

As has been mentioned in 3.1, the resource limits for the shell can be set using the shell's built-in commands. This way, one can set limits dynamically for the processes one is about to start from the running session. The same approach also works for modifying startup scripts (though one needs to be careful not to set the limits in a file that gets 'sourced' by some permanent process, thereby setting the limits the all the other processes started by that same 'master' process). Note that only root can increase the hard limits, other users can only decrease them from the ones in use.

Another way to change the per-process resource limits is through sysctl variables. Each process has a number of variables, that can be read or set with sysctl, and their names start with proc.PID.rlimit. Here are the variables for the same shell as before, but this time without forcing the soft limits to match hard limits. Since the PID of the running shell was '27390', I've used the command `sysctl proc.27390.rlimit´:
(only partial output shown)

proc.27390.rlimit.datasize.soft = 134217728
proc.27390.rlimit.datasize.hard = 1073741824
proc.27390.rlimit.stacksize.soft = 2097152
proc.27390.rlimit.stacksize.hard = 33554432
proc.27390.rlimit.memoryuse.soft = 520331264
proc.27390.rlimit.memoryuse.hard = 520331264
proc.27390.rlimit.memorylocked.soft = 173443754
proc.27390.rlimit.memorylocked.hard = 520331264

You could use sysctl to set the limits for a running process that for some reason does not have the wanted limits set. Setting them in /etc/sysctl.conf is not a good idea, since one does not know the correct PID at that time. There is yet another place where one could set per-process resource limits: /etc/login.conf. These limits affect login processes, eg. users logging into the system. You might want to consider setting their limits a lot lower than for some system specific process co-existing with users' processes. The exact limits depend on available memory, number of users, number of processes, and the needs of those users.

Here is an example of what the /etc/login.conf might look like:
(You probably do not want to use these specific values)

default|Default limits:\
        :datasize=48M:\
        :memoryuse=32M:\
        :memorylocked=4M:\
        :stacksize=8M:

Finally, one could set the default limits in the kernel config, and then compile a custom kernel. The following kernel config options could be used to set the per-process limits in kernel:

	options MAXTSIZ
	options DFLDSIZ
	options MAXDSIZ
	options DFLSSIZ
	options MAXSSIZ

4. Related issues

4.1. Tools

Here I've tried to collect some (hopefully useful) tools that may help in understanding the system behaviour.

cache usage: To check the file and buffer cache usage one could use `systat bufcache´.

pkgsrc/sysutils/xuvmstat: Originally written along with the UVM code, xuvmstat was used to watch VM counters in real time in a X window.

4.2. Tips

Here I've tried to collect some (hopefully useful) tips, that are related to VM tuning, but are not neccessary.

removing swap: When starting to tune the VM limits, one may already be using swap. After setting new limits, it would be nice to start from "scratch" without booting, ie. get all pages back to RAM. I've used `swapctl -U´ and `swapctl -A´ to disable and enable swap, respectively. Note that I don't know what disaster may take place if there is not enough RAM when disabling swap.

5. References

There are lots of reference material for NetBSD. The following is a very limited subset that I have used while trying to find information for this web page. This is by no means complete, and a lot of the content has evolved (and continues to evolve) through time (eg. manual pages).

I also want to thank all NetBSD developers and users, who offered their wisdom (either privately or on mailing lists) while I was writing this page. Without their help, this page would have been an embarrasing failure. Thank you!

NetBSD documentation:
- Kernel FAQ
- Tuning NetBSD
- UVM FAQ
NetBSD mailing lists:
- current-users
- tech-kern
NetBSD manual pages:
- login.conf(5)
- options(4)
- pmap(1)
- sysctl(3)
- sysctl(8)
- sysctl(9)
- systat(1)
- vmstat(1)
NetBSD source code:
- src/sys/kern/vfs_bio.c
- src/sys/uvm/uvm_pdaemon.c

URL for this document is http://www.selonen.org/arto/netbsd/vm_tune.html
Last modified on Nov 18, 2004 by Arto Selonen