Self-Tuning Systems Software
Joseph S. Barrera III
One Microsoft Way
Redmond, WA 98052-6399
Systems software that tunes and reconfigures itself is both feasible and increasingly necessary. The necessity of self-tuning systems arises from the increased complexity of systems software combined with the broader audience for such systems. We intend to demonstrate the feasibility of self-tuning systems by constructing one based on an architecture which separates the tasks of defining expectations, measuring actual performance, analyzing measurements in comparison with expectations, and performing actions in response to analysis which can range from gathering more data to reconfiguring major components of the system. This separation of responsibilities allows systems components to concentrate on performing well within narrower bands of operation, leaving the analysis agent to make more global and longer term decisions regarding the best operating parameters and component implementations to use.
The need for self-tuning systems software arises from the increasing complexity of today's systems combined with the decreasing fraction of users willing or able to tune such systems themselves. Even personal computers purchased solely for word processing come with systems software that provides paging, scheduling, window management, device management, and filesystem caching, all of which can be tuned for better performance. However, most consumers of such systems either have no interest in doing this tuning, or don't even know that such tuning is possible. More sophisticated systems (e.g., workstations or networked personal computers) add multiple network protocols, multiple filesystems, and increasingly, multiple APIs (e.g., some combination of POSIX, DOS, Windows, and OS/2). Tuning such systems is even more complicated, given that the usage of the various subsystems varies wildly based on the application mix, and since any single action (e.g., opening a document) may require services from many subsystems. At the same time, such tuning becomes more critical, since quick startup of applications and good overall performance may depend on observations and knowledge which extend far beyond the window of time used by, say, the virtual memory system's page replacement algorithms.
Our belief in the feasibility of a self-tuning system, as well as our proposed architecture, is based on our experience with manual systems configuration and optimization as a largely mechanical process. Generally, one starts off with a high-level operation which either takes longer than expected or which one wishes to optimize. One then examines its subcomponents recursively until one finds subcomponents which are operating outside of their designed limits or which are failing to meet performance expectations despite being operated within their designed limits. Finally, one either adjusts the parameters of the problem subcomponents to better match their operating conditions (e.g., increasing a file server's cache size), or replaces a subcomponent's implementation with another. (Sometimes this replacement requires actually writing the new implementation, which of course is not a mechanical process, at least not yet.)
To automated this process, one must identify the various pieces of knowledge used and attempt to codify them. We see the important pieces of knowledge as being: what components are in the system; which components use which other components (i.e., what is the dependence hierarchy); what interface does each subcomponent export; what are the parameters (such as cache size) which can be adjusted for the subcomponent; and what is the expected performance and usage of the subcomponent, described in terms of the interface and the adjustable parameters. This information should be present both for subcomponent implementations currently configured as well as alternative implementations available for reconfiguration.
We believe that self-tuning should be divided into four components: expectations, measurement, analysis, and actions. A set of expectations describes how the system and its subcomponents should behave under various conditions. Measurement gathers data about actual conditions and behavior; subsequent analysis then determines whether the expectations are being met, and which actions should be performed as a result; actions can range from gathering more data to performing dynamic reconfiguration.
We furthermore believe that analysis and expectations should be concentrated in a single agent and database, respectively, and that the only self-awareness task performed by actual systems components should be measurement of conditions and resulting behavior. Concentrating analysis in a single agent allows that agent to make global decisions with information from all systems components while simplifying the implementation of those components. Of course, we do not preclude all adjustment of subcomponents to changing inputs; thus we would expect that a virtual memory system would make sure that it increased its pageout rate to match an increased fault rate. However, we would want to eliminate code in the virtual memory system that attempts to guess when pages belong to large files and subsequently to treat such pages differently; instead, the analysis agent should detect when it is inappropriate for the file server to cache large files in virtual memory and to tell the file server so directly.
Each component in the system would have expectations of use and resulting performance stored in a component database, along with the other information described in the Feasibility section above. These expectations could be determined automatically or stipulated by the component's developer. In either case, a side effect of the system's self-analysis is the revealing of any discrepancies between the developer's intentions and actual code.
Performance and usage data could be determined automatically by either measuring an actual system or (in the case of performance expectations) performing an automated analysis of the algorithms used by the implementation. During development, such automatically generated data could be used to discover relatively untested or poorly performing parts of the system. After a system has been released, such data could identify when a system was being used outside of designed or tested limits.
Alternatively, stipulated (programmed-provided) performance and usage data could serve as performance assertions, particularly during development when it is not clear which parts of the system should be optimized or what the actual usage of the system will be. For example, a simple root name server may expect to be called frequently but to hold only three or four names, and thus may use a pure linear scheme for best expected performance. If in practice the root name server is actually used for many more names, then a more complicated but more scalable implementation would be indicated. Conversely, expected usage information for heavily optimized common cases could identify cases that turned out in fact not to be very common, allowing such code to be replaced with smaller, simpler, or machine-independent versions.
Each systems component would perform measurement of its inputs and activities; however, the decisions as to what exactly to measure (out of a larger, fixed set of possibilities) would be made by the analysis agent. Measured data would be entered into a database which could subsequently be examined by the analysis agent as needed. Possible options for reducing the overhead of such dynamically selectable measurement include dynamically linked measurement code, downloadable measurement preprocessing, and database entry without context switching. Note that dynamically linked measurement code could in principal expand the set of things that could be measured in a systems component; the hard part, however, would be describing this expanded set. Allowing measurement of anything beyond call counts and parameters of exported methods would require a description of the component's internals, which would seem both difficult and undesirable.
The analysis agent decides what should be measured, compares these measurements with its expectations, and initiates subsequent actions ranging from reconfiguration to expanding the scope of what is to be measured. The analysis agent uses a hierarchical scheme to limit the amount of measurement and analysis required: only if higher-level operations fail to meet expectations are lower-level operations measured and analyzed. A key part of the analysis agent's analysis is determining how much analysis it should do, balancing the cost of its analysis with its likely benefits, estimated from the gap between expectations and reality for the topmost expectations in the hierarchy. While most of its analysis is driven by its own observations, it should also able to handle external queries (perhaps including "why did everything get so slow all of a sudden?"). Finally, since all of its knowledge of the system is provided by the configuration database, the analysis agent is not system specific and can easily accommodate new subsystems.
The analysis agent is careful to keep the benefits of its analysis from being negated by the cost of analysis. It limits the total amount of analysis by using a hierarchical scheme of expectations; if higher-level expectations are being met, then lower-level expectations are not tested. For example, if average file read performance is reasonable, then there is no reason to carefully examine file caching efficiency. Even when higher-level expectations are not met, the gap must be sufficient to be worth the expected cost of analysis, and must not be dominated by other failed expectations that should be investigated first. Finally, repeated failures of expectations should be noted (such as the virtual memory system's inability to keep paging down to a reasonable limit, due probably to not enough physical memory), so that the same analysis is not repeated over and over. Note that this last point suggests two parallel sets of expectations, one for the ideal case and one for reality.
Another possibility for reducing the impact of analysis cost is doing it during times of reduced load. Thus if loading a particular application makes the system comatose for a couple of minutes, the analysis agent shouldn't make the situation worse; at most it should enable more measurement, but it shouldn't spend resources analyzing what happened until the peak load passes. Of course, if the peak never passes, then the analysis agent should start figuring out what's wrong. Determining the correct meta-policy for analysis clearly requires experimentation.
A key advantage of doing analysis in a single agent, as opposed to separately in each component, is that the single agent can select the most appropriate components to adjust or replace to remedy a problem, instead of each component trying to do something about it. For example, if the file system, as a result of caching large files in virtual memory, is causing excessive paging, the virtual memory system does not have to identify large objects as a problem and then attempt to treat them differently (e.g., flushing them out of physical memory sooner); instead, the analysis agent can detect the problem and subsequently inform the file system to lower its limit on the size of files that it caches in virtual memory.
Another key advantage of doing analysis in a single agent is that the agent can afford a longer memory than can individual components. Thus while the file system's memory generally extends as far as its LRU queue, and at best as far back as last system bootup, the analysis agent can record its observations over months or years, allowing it to notice, for example, which files are always read whenever a particular application is started for the first time since bootup.
Once the analysis agent has detected a failed expectation or useful pattern, it performs one of several possible actions. It can request more measurements; it can alter operating parameters for either the affected subsystem or other subsystems using the affected subsystem; or it can recommend the replacement of the current implementation of the subcomponent with an alternative implementation. If devices, processors, and memory can be described as subcomponents, then the system could recommend which hardware upgrades would do the most good. Finally, if the system were able to replace major subcomponents at run-time (an ability we are separately researching), then the system could reconfigure itself automatically on the fly, perhaps even downloading new component implementations from vendor-provided servers.
5. Research Questions
We expect that the primary research question that we will have to answer for self-tuning systems to become a reality is the representation issue: how does one represent a component's performance and use expectations, its tuning parameters, and its measurable aspects? Many simple schemes are possible, e.g., characterize usage as calls per minute, performance as microseconds per call, etc. In practice, these measures will have to be parameterized; e.g., performance depends on the exact inputs supplied. The challenge, then, will be coming up with a scheme rich enough to be useful yet simple enough to be manageable.
The other main research question is how to limit the performance impact of the actual analysis. While we believe that examining the expected benefits of analysis before actually performing such analysis will be a key part of a successful strategy, we expect that more will be needed to determine when and how much analysis should be done. At worst, of course, a system can always just measure itself during the day and analyze the measurements at night; even such batched self-analysis is far better than none at all.