Interval-based Registration Cache for Zero-Copy Protocols

Usual networks are not adapted to HPC requirements. Nowadays offer- ing 10 times more bandwidth with 10Gb/s solu- tions, and latencies as little as a couple ...
303KB taille 2 téléchargements 161 vues
École normale supérieure de Lyon Fundamental computer science master, First year

Interval-based Registration Cache for Zero-Copy Protocols Cédric Augonnet Supervised by Loïc Prylli & Patrick Geoffray

M YRICOM Software Development Lab Oak Ridge, TN, USA June - September 2007

Acknowledgments I would first like to address many thanks to all the people I met in M YRICOM, who really made possible to work in great conditions, especially Loïc and Patrick who took some of their precious time to teach me more things that I would ever have hoped when coming here, and offered me the support to work on very exciting material. Once again, thanks to Brice who also supported that work mostly inspired on his thesis. Eventually, I am very grateful to Abby and Marty who really made that summer as enjoyable as possible.

Abstract Zero-copy protocols are essential for large message performance in high-speed networking. But these require costly memory registration mechanisms to allow the network interface to access user-space memory. M YRINET E X PRESS nowadays offers a basic registration cache scheme to reduce the overhead of that memory registration, but maintaining its consistency yet forces the user-space to monitor any changes in the virtual memory, which we show to be inefficient. After integrating ”VMA spy 2” in the L INUX kernel to perform this monitoring inside the kernel in a reliable way, we modify M YRINET E X PRESS to use it. We then demonstrate that M YRINET E X PRESS current implementation of the registration cache suffers from weaknesses. We therefore add a new caching facility in the M YRINET E X PRESS driver, giving it a reasonable grain-size that we show to be more adapted that DMA windows. M YRINET E X PRESS is then modified to integrate that new cache. We eventually demonstrate how our two possibly independent contributions are helping to improve registration cache techniques and consequently zero-copy protocols. K EY WORDS : zero-copy, memory pinning, cache registration, L INUX kernel, M YRINET E X PRESS, high-speed networking, operating systems.

Contents Acknowledgments

1

Abstract

1

Introduction

3

1

High-speed networking techniques 1.1 OS bypass architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Zero-copy communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 4 5

2

Memory registration 2.1 Need for Memory Pinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Memory pinning cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Actual costs related to memory pinning . . . . . . . . . . . . . . . . . 2.2.2 Is it still worth using zero-copy ? . . . . . . . . . . . . . . . . . . . . . 2.3 State of the art of memory registration handling . . . . . . . . . . . . . . . . 2.3.1 The naive method : ignore the problem . . . . . . . . . . . . . . . . 2.3.2 The do-it-yourself method : explicitly handle registration . . . . . . 2.3.3 The costly approach : the solution of Q UADRICS . . . . . . . . . . . 2.4 The current solution of MX . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 A communication grain-sized registration cache : DMA windows 2.4.2 The lack of flexibility of DMA windows . . . . . . . . . . . . . . . . 2.5 Our objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

4

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

6 6 7 7 7 7 8 8 8 9 9 9 9

VMA S PY 2, toward a robust registration cache consistency in the L INUX kernel 3.1 Consistency issues with our registration cache . . . . . . . . . . . . . . . . . . . 3.2 Is that issue not already solved by M YRINET E X PRESS ? . . . . . . . . . . . . . 3.2.1 Monitoring virtual memory with glibc hooks . . . . . . . . . . . . . . . . 3.2.2 glibc hooks weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 First contribution : VMA S PY 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 The VMA S PY 2 framework . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Implementing VMA S PY 2 in the L INUX kernel . . . . . . . . . . . . . 3.4 Modifying M YRINET E X PRESS to take advantage of VMA S PY 2 . . . . . . . . 3.4.1 Registration cache design without VMA S PY 2 . . . . . . . . . . . . . . . 3.4.2 Registration cache design using VMA S PY 2 . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

10 10 11 11 11 11 11 13 13 13 14

MX P IN C ACHE, an aggressive optimisation of the registration cache 4.1 Limitations of the previous registration cache design . . . . . . . . . . . 4.1.1 Granularity issues with DMA windows . . . . . . . . . . . . . . . 4.1.2 Preliminary zoom on zero-copy low-level implementation in MX 4.2 Second contribution : MX P IN C ACHE . . . . . . . . . . . . . . . . . . . 4.2.1 Overall concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Design of MX P IN C ACHE . . . . . . . . . . . . . . . . . . . . . . 4.3 Modifying MX to take advantage of MX P IN C ACHE . . . . . . . . . . 4.3.1 Should it replace DMA windows cache model ? . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

15 15 15 15 16 16 17 18 18

1

. . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

4.3.2 4.3.3 5

6

Could the NIC use cached data directly? . . . . . . . . . . . . . . . . . . . . . . . . . Making VMA Spy 2 and MX P IN C ACHE independent . . . . . . . . . . . . . . . .

Experimentation with both contributions 5.1 Test environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Functionality and correctness : cache consistency under all circumstances 5.2.1 No regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 A wider range of uses . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Performances : speeding up M YRINET E X PRESS . . . . . . . . . . . . . . . 5.3.1 Expected results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Observed performances . . . . . . . . . . . . . . . . . . . . . . . . .

19 19

. . . . . . .

20 20 20 20 20 21 21 21

Conclusion 6.1 What was done in this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 What was not presented in this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Future works and perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22 22 22 22

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

A Measurement results

24

B L INUX memory management B.1 VMA and process address space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Paging and virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26 26 26

C Notes for implementing VMA SPY 2 in the L INUX kernel

27

D MX P IN C ACHE implementation D.1 The container data structure . D.1.1 Interval manipulations D.1.2 Metapage handling . . D.2 MX P IN C ACHE in action . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Bibliography

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

29 29 29 30 30 31

2

Introduction With regard to the perpetual evolution of technologies, the request for computational power never stops to increase, especially for simulation oriented applications, whose perspective on having always more powerful machines calls for always larger processed data, while asking for faster and sharper results. In 1965, G. M OORE claimed that transistor density, thus computer power, should follow an exponential growth law [10]. This visionary prediction was verified for a very long time until the late 90’s where uni-processors became so close to physical limits that the only solution for keeping that rhythm was to adopt an other strategy : entering in a parallel processing mood. Still, such a parallelism trend is not that recent. The first multiprocessor machines actually dates back to the B URROUGHS B5500 in 1961. Later on, memory rapidly turned out to be slower than processors, leading to the introduction of Non Uniform Memory Architectures (NUMA) as early as the seventies. That way, architects could face what would be called the Memory Wall [14] in 1994. The next step toward that highly hierarchical tendency was to actually link those nodes together into what are called clusters. While super-computers are generally offering very high performance by adopting proprietary architecture and network designs, they often turn obsolete rapidly as technologies keep improving continuously.

On the other hand, clusters are simply dozens to thousands of rather inexpensive nodes, made out of off-the-shelf machines, linked together using an high speed network. Clusters thus offer a very good performance-price ratio while being very evolutive. Even though super-computers are still keeping some very elitist niche of the High Performance Computing (HPC) market, clusters made possible to achieve still impressive computational power for those who do not have unlimited economical resources. Still it is interesting that highend technologies tend to be first implemented in super-computers, before being exported to clusters and then more mainstream markets. Some key for clusters performance is indeed high speed networks. Usual networks are not adapted to HPC requirements. Nowadays offering 10 times more bandwidth with 10Gb/s solutions, and latencies as little as a couple of microseconds, to compare with an order of a couple tens of micro-seconds observed on a common gigabit network card. There are actually multiple high-speed interconnect technologies available nowadays, among which SCI from D OLPHINICS, M YRINET from M YRICOM, Q UADRICS, I NFINIBAND etc. That list naturally being completely non exhaustive. Unless specified, we shall now specifically consider the M YRINET technology in the following.

3

Chapter 1

High-speed networking techniques Achieving high performance levels often requires to make use of particular methods that are not (yet) applied for usual networks. We will here present two examples of such techniques : firstly removing the kernel from critical paths and secondly the use of zero-copy in order to prevent expensive useless copies. Still, it is noticeable that many of those techniques are eventually used in mainstream technologies : for instance the use of Direct Memory Access (DMA) is now widely spread, even for low-end devices.

1.1

necessary minimum. To achieve this goal, M YRINET E X PRESS, like other similar libraries, makes use of a technique called OS-bypass, as shown on right part of figure 1.1. The idea is to allow the application, or even a possible underlying communication runtime system, to access the Network Interface Card (or NIC) directly, without any support from the kernel during those critical paths. U- NET being one of the first implementations of such a now widely spread method [12].

OS bypass architecture

Application User space

On U NIX systems, networking is often performed using the Socket interface, and goes through the TCP and IP layers as shown on figure 1.1 left part. Several issues arise with those kernel networking stacks. In order to issue any message, data have to flow along numerous layers, dramatically deteriorating the latency 1 to an order of tens of micro-seconds. Moreover, using a kernel stack implies to issue many system calls, which are also very costly. Not only they increase the latency even more, but also issuing a lot of system calls means an important system load, therefore impairing overall machine performance.

Socket Interface UDP

2 On

Communication Lib.

TCP Initialization

Kernel

OS Bypass

IP Ethernet

NIC Driver Firmware

NIC

Figure 1.1: Kernel Stack vs. OS Bypass

However, there are some paths that actually have to go through the kernel and for which bypassing is not an option. For instance, figure 1.1 shows that the initialization is done using the kernel, by example to set up all the facilities that make possible to defer most work in user-space. Therefore, in order to make use of devices such as M YRINET cards, there are three main components to set up : a firmware that is run by the NIC processor2 , a driver than runs inside the host kernel and a user-space library that may interact with

Dealing with HPC, having an overloaded system while making use of the network actually is a terrible waste of precious CPU cycles. More than that, the latency explosion resulting from the large number of layers to go through is fairly unacceptable for critical paths. Thus, modern high-speed networking facilities try to reduce as much as possible the need to go into the kernel, and reduce critical path to their strictly 1 the

Middleware (MPI,...)

time from the source sending a packet to the destination receiving it M YRINET cards, this processor is known as the L ANAI

4

both the driver and the firmware. It is otherwise worth noting that if OS bypass used to be fundamental for performance, the cost of system calls can now drop to 100 ns on modern architectures using recent OS techniques such as fast calls on L INUX. One could therefore question whether OS bypass has not turned obsolete or not.

1.2

the host directly. Since the host must not perform any copy, the buffer to be transferred is likely to be described in virtual address space. Unfortunately, M YRINET cards, as most others, cannot use those virtual addresses, but work in physical address space. Prior to sending a request to the NIC, the library must first manage to get a description of the buffer location translated into physical addresses.

Zero-copy communications

Another major technique for high speed networking is the use of zero-copy protocols. Considering usual networking stacks, before being actually sent or received, a message is copied out several times from buffer to buffer. As the number of layers might be quite large, the number of intermediate buffers could grow as well. For large messages, this means the host CPU may be continuously copying the same piece of data throughout memory3 .

Memory

User

Kernel NIC

Address Translation

DMA

Figure 1.2: Zero-copy protocol

A first improvement would be to reduce the number of those buffers, not to waste too much CPU cycles, repeatedly doing the same copies. C. D ALTON ET AL . thus proposed a single-copy protocol stack for TCP/IP [4]

As we shall see later on, that translation work does not come for free. For small messages, explicitly having the host processor to move data is less costly than using a zero-copy protocol, which ends to be counter-productive in such a situation.

However, on high speed network cards as M YRINET ones, there are not only a processor, but also a DMA engine. This is mind, it is possible to do a much more aggressive optimization : we could simply not have the host CPU do any copy at all. Such a protocol thus being called zero-copy.

Historically, this technique was only used in HPC, but it turns out to be another method that is being adopted in more traditional networking stacks. It is therefore quite common on the send side with TCP/IP. And RDMA that we shall discuss in section 2.3.2 is more or less an implementation of zero-copy on the receive side.

Instead of having the host processor explicitly copying data to the NIC, the main concept is to have the host processor telling the NIC where to find the data, so that its DMA engine can perform the copy by itself. This releases the host processor from a large burden, especially with large messages where continuously copying data would severely alter host performance.

From now on, we will concentrate on improving zero-copy protocols by reducing as much as possible the overhead it introduces. Even though some vendors see in zero-copy some kind of generic solution, we will now see that it has many very challenging issues that must be addressed efficiently. One should actually remark that many of these drawbacks are either not properly solved yet, or simply hidden by biased vendor test suites as shown by P. G EOFFRAY [5].

All the challenge is to actually give the DMA engine sufficient information to grab the data from

3 Those

explicit copies by the host are usually referred to as Programmed IO (or PIO)

5

Chapter 2

Memory registration As section 1.2 suggested, zero-copy protocols are critically important for large messages. In spite of the idea that it completely releases the host from any computational burden, they actually cause a rather unexpected overhead. That might, if not addressed properly, turn zero-copy even more expensive than PIO for the host, while the introduction of zero-copy was motivated by reducing that very same overhead on the host.

2.1

2. The driver translates v into a physical address p using L INUX paging facilities. 3. The driver gives p to the NIC. 4. The DMA engine on the NIC directly copies data from the user-space buffer onto the network without any intermediate buffering. Unfortunately, things are not obviously always that smooth, and by the time the DMA engine have finished step 4, the mapping established at step 2 might somehow have been changed. There are two classes of events that may cause this :

Need for Memory Pinning

For the sake of simplicity, let us consider a process requesting the sending of a message coinciding with a page at address v in virtual memory 1

• Programming errors : a user could, for instance, request the send of a message on a buffer B and perform a munmap system call on B. If the syscall is performed before the DMA engine ends up its copying, the previous mapping has no sense anymore. This being an error from the user, we must only make sure the system is not compromised, even though the user process might get undefined behaviour.

Process

v 1

Driver

p NIC

2

3

4

v

• Page swapping : in case memory resources come to a state where the kernel needs to reclaim some memory from processes, it is possible that the physical address p becomes invalid since v might refer to a page swapped out between steps 2 and 4. This being a perfectly normal behaviour, it has to be addressed so that the NIC will not perform DMA on outdated physical addresses.

DMA engine

(Network) Figure 2.1: Zero-copy example

As suggested previously, our ultimate goal is to supply the NIC DMA engine enough information to fetch that memory page from host memory : figure 2.1 shows how this can be achieved.

Let us consider the swapping issue more thoroughly. It is necessary that by the time the DMA transfer ends, the transferred page should not be swapped out. 1. The user-space application supplies the virAs shown in details by M. W ELSH ET AL . [13], tual address v to the driver. solving the swapping issue is usually possible by

1 see

appendix B for more details on virtual memory in the L INUX kernel

6

pinning memory while performing the memory translation (stage 2 of figure 2.1). If memory is kept pinned by the time the DMA transfer is done, page swapping is not an issue anymore. Let us remark that if the process was requesting to transfer a page currently swapped out, this would first put that page back into memory. As a result, to perform zero-copy, we must first pin-down all memory involved.2 Section 2.2 will show that this leads to important overhead affecting zero-copy protocols performances . We previously mentioned that programming errors may lead to a situation where a translation might not be valid anymore. Since our only concern is here to make sure the system is not compromised, it is worth noting that pinning a virtual page also ensures that if anyone is referring to that page, the physical page will not be reused. Therefore, if there was a pending DMA access, the pinned physical page will remain unchanged so that memory consistency is not at stake from the kernel point of view. However, user-space application could observe undefined behaviours, which is acceptable since this happens in the context of a programming error.

the cost of memory pining and unpinning is linearly dependant on the number of page to be handled, plus an initial cost. Unfortunately, every single physical pages must indeed be pinned one by one, so we obtain that linear shape. Moreover, there is an additional cost caused by the system call itself. This explains why some vendors try to hide that issue, and why that problem must be addressed carefully.

2.2.2

Is it still worth using zero-copy ?

Let us take a critical view of zero-copy in such conditions : should we still use it even though it might end up doing the opposite of its initial aim, increasing CPU work on the host ? A. R AOUL ET AL .[2] compares PIO and DMA performance regardless of a possible registration overhead. According to them, while DMA leads to very large asynchronous bursts on the bus once initiated, PIO involves many small messages on the bus, usually word-sized, which is much less efficient. PIO have no initial cost but has rather limited bandwidth since the host CPU has to write any single word on the bus one by one. Not only the CPU is kept busy, but also it is a highly inefficient 2.2 Memory pinning cost procedure as it performs a synchronous bus access per word. As suggested previously, memory pinning is not In addition, rather than using PIO, it is comonly required, but also rather expensive and in- mon that networking stacks are issuing a single fluential on the performance of zero-copy protocols. copy followed by a DMA transfer. When comIn that section, we will pay attention to the actual paring zero-copy and one-copy protocols, it appears costs induced by memory pinning. that we have on the one hand the cost of registration, and on the other hand the cost for explicitly copying pages. On figure A.2 of page 25, 2.2.1 Actual costs related to memory it is shown that once again, with a packet larger pinning than a given threshold, zero-copy with registration In previous section, we suggested that taking mem- remains more interesting than one-copy. On the ory pinning into account or not does make a differ- set of benchmarked platforms, it appears that this ence on the overall performance. Let us consider threshold is always smaller than 32KB. the cost of both pinning and unpinning memory. To get the results of figure A.1 on page 24, 2.3 State of the art of memory we considered the pining and the unpinning of a buffer through the use of MX interface (by calling registration handling the mx_{,un}register functions). The experiment was performed on an AMD O PTERON 2.2 As we have seen in section 2.2, one must take care GH Z bi-processor with 4 GB of RAM running MX to properly handle memory registration. If no at1.2.1 on a 2.6.17 L INUX kernel. Each measurement tention is paid, a zero-copy protocol is likely to be was repeated 10000 times, and there was sufficient even more expensive for the host CPU than a simRAM so that there was no swapping during the ple PIO which is the exact opposite of our initial experiment. goal. Even if those results are highly machine depenAs high-speed networking does rely on zerodent, the overall conclusion we can draw is that copy for large messages performance, most ven2 In

the following, we will also mention this pinning as memory registration

7

dors had to find a work-around for that issue. We therefore take a look at some other high-speed interconnect solutions among those available on the market.

Somehow, this may lead to a very efficient memory registration handling, but it requires some work from the programmer and also some understanding of the underlying issues. Sure, this can be efficient when great care is taken for those 3 2.3.1 The naive method : ignore the registration , but this is not the duty of programmers to handle difficulties introduced by low-level problem communication protocol designers. However, MPI does not take memory registraAs often in such situation, the first approach is to simply ignore the problem. Indeed there is a tion into account. And while its semantic is based straightforward solution that does allow zero-copy on a communication grain-size, RDMA would reon a user-space buffer, regardless of the pinning quire a page-sized or even buffer oriented granucost issue. Let us have a careful look at it, since we larity. Therefore, letting the upper communication layers take care of the memory registration is not will then optimize that procedure. always easy, and the fact that MPI does not match 1. Request the kernel to pin down the user the semantic of RDMA is a very compelling argubuffer. ment against its generalisation. 2. Store the physical addresses obtained while performing the memory registration.

2.3.3

3. Give those physical addresses to the NIC.

In term of high-speed networking, Q UADRICS often designs very efficient cards. But may they be offering outstanding performance, their card end up extremely expensive as these impressive numbers do come at a price : noticeably much more oncard high-speed SRAM memory : 10 G NICs from Q UADRICS embeds 64MB and cost $3000 while M YRINET 10 G solutions only have 2MB of SRAM but costs around $700 The E LAN driver not only needs a lot of on-card memory, but it also requires patching the L INUX kernel. E LAN actually implements a very interesting work-around by making those memory registration useless. One must remind that the reason for those registrations is that by the time physical translated addresses are actually used by the NIC, translations may become outdated. By the mean of a patch to the L INUX kernel, E LAN is notified of any change into memory mapping, and the huge amount of SRAM memory is partially devoted to maintaining a sub-set of the process page-tables. Using such a trick, Q UADRICS NICs can implement a MMU. So the host needs not supplying physical addresses as virtual ones can be translated directly using this MMU. This strategy thus completely dodge our registration problem as there is no need for – keeping up-todate – memory translations anymore. Unfortunately, this approach does not come for free as one needs a sufficient amount of memory to maintain a page-table copy. And keeping this synchronized with the actual page-table does imply to

4. The sending NIC shall then perform a DMA read on each of those physical pages to send them on the network. Similarly, the receiving NIC would DMA write the incoming network data onto the physical pages it was given. 5. Once the transfer is finished, the registered memory is unpinned since it does not need to be protected anymore. With such an approach, every single message is first registered and then unregistered, which mostly affects latency. Unregistration should also deter performance as it might add some delay between the arrival of the message, and its acknowledgement to the receiving process.

2.3.2

The do-it-yourself method : explicitly handle registration

In order to get rid of memory registration overhead, it is possible to offload it from the critical path. There is nowadays some kind of hype around Remote Direct Memory Access (RDMA) protocols that all promise great performance using a protocol based-on zero-copy : every node can explicitly manage some registered buffers in which any other node may perform a remote access, by the mean of a DMA transfer. Actually, RDMA is mostly a way to tell the user to handle memory registration by himself when declaring those remotely available buffers. 3 or

The costly approach : the solution of Q UADRICS

more likely when some communication runtime system as MPICH handles these issues in between

8

• If some DMA window does correspond exactly to the current buffer, then we can shortcircuit the naive method : indeed as the DMA window was kept as is after its previous use, the buffer is still pinned, and there is no need to perform the address translation. Thus the NIC directly fetches the previously stored translation.

watch carefully any single change of the memory mapping at a page sized level, thus modifying the core of L INUX memory sub-system [1]. Eventually, one may argue that if it saves registration overhead, this makes Translation Look-aside Buffer invalidation more expensive. Once again, we have an example of how performance often involves trade-offs.

2.4

As suggested, it could sometimes be needed to get rid of cached data not to waste too much memory for a over-aged useless cache. When this happens, the corresponding buffer is unregistered, and the DMA window is released. More details about the implementation of such a registration cache will be given in section 3.4.1.

The current solution of MX

As M YRINET cards do not have as much memory as Q UADRICS ones, the solution proposed in section 2.3.3 is not applicable directly. Let us look how MX deals with memory registration.

2.4.1

A communication grain-sized reg- 2.4.2 istration cache : DMA windows

As often, a solution may come from caching techniques. MX does indeed introduce the notion of DMA windows. It is a rather simple variation of what we presented in section 2.3.1, called the registration cache which H. T EZUKA ET AL . also implement on M YRINET [11].

The lack of flexibility of DMA windows

In many case, this very simple caching scheme makes possible to have important performance improvement : but we shall see in section 4.1.1 that it suffers from being too rigid.

2.5

• When a buffer is used for the first time, it is associated with a DMA window : this data structure accounts for the buffer beginning and its length in virtual memory space.

Our objectives

Now that we exposed the current state of the art concerning memory registration we will try to add 2 new contributions :

• The whole buffer is pinned down. The address translations are kept in another buffer associated with that DMA window.

• In chapter 3, we will make the registration cache much more robust by managing its consistency from the kernel. To that regard we shall modify the core of the L INUX kernel, and then integrate this into M YRINET E XPRESS .

• The registered data are transmitted on the wire as explained before. • When the transfer is done, the DMA window is kept as is, nothing else is done for now.

• In chapter 4, we shall also address some performance issues by modifying the caching scheme to apply some aggressive approach that should increase performance by raising the cache hit ratio, and reducing the cache miss average cost. This should especially be adapted for real-life application.

Now let us see how this possibly improve our protocol. According to the common locality concept, it is rather likely that this very same buffer may be used once more : • When a communication is done on a userspace buffer, some checks are first performed in order to figure out if it corresponds to some existing DMA window.

Moreover, those two improvements are completely independent so that each of them may be used separately : the first method being more experimental as it would imply patching the kernel, • If not, then we are dealing with a cache miss, while the second one involves nothing but a driver we use the former approach and create a new upgrade. Still, it should be possible to get either DMA window, possibly invalidating some if the improved robustness, or the improved perforthere are already too many windows cached. mance, or both.

9

Chapter 3

VMA S PY 2, toward a robust registration cache consistency in the L INUX kernel • As C1 is the first communication, a new DMA window is created, and B is registered before C1 actually occurs.

In the following chapter, we will try to demonstrate that the registration cache we presented in section 2.4.1 introduces some new issues that must be addressed to maintain our cache consistent. To that regard, we will first propose a new facility for the L INUX kernel named VMA S PY 2. We will then implement it, and modify MX so that it should use VMA S PY 2.

3.1

• The MX user-space library has no clue of the remapping of B. • When C2 happens, the library founds back the very same buffer as in C1 , thus assumes a cache hit. The NIC is therefore given the same physical addresses as for C1 . Since memory registration fixed the previous pages in physical memory, the NIC uses out-dated pages, with File1 content instead of File2.

Consistency issues with our registration cache

Unfortunately, the protocol we studied in section 2.4.1 still leaves issues. We indeed claimed that the changes in memory mapping are either explained by swapping or by programming mistakes. If memory is kept pinned only during the communication, then any memory mapping change is an abnormal behaviour. But let us consider a situation that may happen if we use our registration cache :

We just demonstrated that if no care is taken, even with perfectly well behaving applications, our registration cache might lead to serious data corruption. There are actually various events that may leads to such cache inconsistencies, for instance : • Explicit calls to mmap or munmap.

1. File1 is mapped in user-space thus creating a buffer B.

• Calls to malloc and free could induce the use of a brk system call which may affect the cache.

2. Some communication C1 is performed on B.

• When writing data, one may break the copyon-write attribute of a physical page, so that the cache refers to the previous page while the user-space could use its tampered copy.

3. File1 is unmapped, and File2 is then mapped at the same location. B thus now contains File2 data. 4. Some communication C2 is performed on B. All this is a perfectly legal behaviour. But let us see how our cache performs :

So in all these rather common situations, the library has to manage to keep its registration cache consistent. 10

3.2

Is that issue not already 3.3 First contribution : VMA solved by M YRINET E XS PY 2 PRESS ? As we saw, maintaining our registration cache con-

sistent is not that easy. The current approach As registration cache is already used in MX, there is to monitor each and every potentially harmful has to be some way the current implementation GL IBC call, but it was found to fail in some situations where the use of registration cache would works around those consistency issues. As we noted, all these changes responsible for therefore be impossible. In that section, we will memory mapping changes 1 , and thus for making present a kernel facility that watches all these cache inconsistent are well known. Among them are changes so that the user-space is not responsible the calls to mmap, munmap, malloc, free, sbrk, for that monitoring, thus making our cache much more robust and reliable, regardless of user applimremap, memalign, realloc and brk. cations behaviour.

3.2.1

3.3.1 The VMA S PY 2 framework Monitoring virtual memory with Up to now, MX was making its best to track glibc hooks

changes happening in the process address space. To do so, the user-space had to make use of very Luckily, the GL IBC, where the user side of those unreliable techniques, affecting the user-space libmethods is usually implemented, offers some fa- erty of programming, and the robustness of the cility for detecting such calls. It is possible to set overall system. up some hooks on those GL IBC functions. For instance, when the application calls malloc, instead of actually performing the call, a special hook function is executed. MX thus manages to create wrappers around all these functions.

Maintaining cache consistency directly from the kernel

All these efforts were done to detect mechanisms In MX, when a munmap occurs, the hook checks which the kernel was intended to hide. It is thereif the corresponding area is overlapping any exist- fore rather natural to try to keep our cache consising DMA windows. If so, these windows are invali- tent by working directly inside the kernel. Thus indated, and the corresponding memory is unpinned. stead of trying to figure out all the user-space beThen when a new communication matches one of haviours that could affect the memory mapping, these windows, we do not reuse an invalid map- we will simply add hooks where those mechaping : a new DMA window is created, and the cor- nisms are triggered. If from the user-space level there are a lot of responding memory is registered again in a cache events that could change the mapping, the numthat remains consistent. ber of mechanisms used by the kernel to implement those changes is very limited. So there is a very little number of hooks to add in the kernel code, which is a clear benefit, compared to putting 3.2.2 glibc hooks weaknesses wrappers around dozens of GL IBC functions. There are actually numerous situations in which glibc hooks cannot help.

Related works : toward a reasonable granularity

The most common problem with registration In section 2.3.3, we said that Q UADRICS adds cache is that if the application was statically com- hooks in the core or the L INUX kernel to detect piled without those hooks, the cache will fail even those situations [1]. More precisely, they track any though the proper environment variable is set. TLB changes, which means that if a n pages buffer Moreover, there are some applications that do is unmapped, n hooks will be executed in a row, their own management of the heap, calling their possibly meaning n successive PIO to the NIC. own proprietary malloc instead of using GL IBC. Our situation is quite different. E LAN has to In such situation, the hooks will fail as well. follow changes with a page-sized granularity since 1 throughout

this document, we will assume that access right modifications and copy-on-write breaking account for memory mapping modification

11

Q UADRICS NICs directly work with virtual addresses, so that unmapping n pages might involve n completely unrelated physical pages, which thus have to be handled one by one. But on M YRINET, the NIC works on physical addresses, so that all our registration cache concerns yet untranslated virtual addresses. In this context, we can have a much less invasive technique : instead of reporting any change to the driver at page level, this is done directly by telling the driver that some range of virtual addresses was tampered. Contrary to physical space, a contiguous buffer in virtual memory can be handled at once. Our work will be done at a much higher level : directly on VMAs2 . Provided that VMA changes are quite rare events and that the total number of VMAs is usually limited to several order of magnitudes less than the number of corresponding pages, we drastically reduce the number of hooks called, compared with some page-sized granularity. With regards to all these remarks, it was found that it could be interesting to reuse a very similar work done by B. G OGLIN named VMA SPY [8]. Noting that W YCKOFF AND W U took the very same approach as well [15], even though their implementation is not generic and may lead to race conditions.

types on a given VMA using another data structure named a spy which efficiently indexes all spy types3 . This framework thus allows to put multiple hooks before any change in the memory mapping : notify_vma_spy_unmap(vma, start, end) Will call the unmap methods of every spy types put on the VMA as soon as the latter is totally or partially unmapped. ... to VMA S PY 2 Still, even though VMA SPY was offering an interesting approach to address the very same issue as ours, it could be somehow be improved. Provided L INUX memory sub-system, all modifications to the mapping are eventually processed by modifying some VMA. Since they are more or less nothing but simple descriptors for virtual memory ranges, there are few methods that may be applied to VMAs : • It can be created. • It can be split up into 2 successive VMA. • Several consecutive VMAs can be merged into a single one.

From VMA SPY ...

• A VMA can be deleted. Initially, VMA SPY was developed in order to allow ORFS (Optimized Remote File System) to And all changes can be expressed as a combimake use of registration cache on M YRINET net- nation of such operations. Let us for instance have works. An ORFS client is indeed residing inside a look at a simple partial unmapping situation on the kernel, so that there must be a mechanism to figure 3.1 : keep the cache consistent directly in the kernel as shown by B. G OGLIN ET AL . [6, 8]. A B C D The concept behind VMA SPY is rather simple : munmap it introduces a structure named spy type : struct vma_spy_type { void (*unmap) (...); void (*fork) (...); void *data; struct list_head vma_list; spinlock_t vma_lock; };

Figure 3.1: Partial unmapping with VMA SPY

In order to unmap [B, C[ from the VMA covering [A, D[, the L INUX kernel goes through several steps :

For instance, if such a spy type is put on a VMA, when some memory belonging to that VMA is unmapped, the unmap method is first called, possibly handling cache invalidation in our context, and then the unmap system call is actually performed. VMA SPY then makes possible to maintain multiple spy 2 see

appendix B for more details on V IRTUAL M EMORY A REAS details about those data structures are given in annex C

3 more

12

1. Split [A, D[ into [A, B[ and [B, D[ thus creating a new VMA [B, D[. 2. Split [B, D[ into [B, C[ and [C, D[. 3. Delete VMA [B, C[.

But with the former VMA SPY, splitting [A, D[ at B would be result into calling the unmap method on all the [B, D[ even before that new VMA is created. Now considering that [B, C[ could be a single page while [C, D[ possibly covers 1 gigabyte, this very simplistic approach spoiled a large amount of cache for no valid reason. Similarly, VMA SPY did lack some support for VMA merges. For the sake of simplicity, when 2 consecutive VMA were merged, VMA SPY would just simulate their being unmapped, thus invalidating both memory areas while no mapping was changed. Making VMA S PY 2 a generic scheme Even though both these situation are fairly uncommon, VMA S PY 2 should address this more carefully to avoid useless cache invalidation. This would also make VMA S PY 2 much more generic than VMA SPY as there is some hidden hazard with not handling this properly . Considering cache consistency protocols, it may be alright to simulate the unmapping of [C, D[. Since when those addresses are to be reused, a cache miss will occur, and the protocol will handle that case as if there had never been any spy on that area. But if VMA SPY was intended to report each and every changes to the mapping, the former implementation would fail. Indeed when the new VMA was created, no spy was put on it so that if it is modified, no change will be reported back. In order to make VMA S PY 2 a generic facility, we should assume that there could be multiple drivers using spies on the same VMA. Even though it was already possible to have multiple spy types on a given VMA, it is difficult to figure out which driver is responsible for each of the types, so we added a spyer field containing a magic number standing as an identifier for the corresponding driver. That way, a driver should be able to determine whether or not he already put a spy on some VMA.

3.3.2

Implementing VMA S PY 2 in the L INUX kernel

Even though this implementation was one of the most important contribution of our work, we decided to defer its description to an annex. So in annex C, we will show how VMA S PY was adapted to our requirements, making it a more generic framework. There we will show that we had to change VMA S PY 2 design into a stateful one, to avoid wasting our cache in some situ13

ations. Moreover we will show how merges and splits were properly supported.

3.4

Modifying M YRINET E XPRESS to take advantage of VMA S PY 2

Now that we implemented some facility to track changes in the memory mapping, we will use it to maintain our registration cache consistent.

3.4.1

Registration cache design without VMA S PY 2

But before showing how VMA S PY 2 can be used by MX, let us look how it performs without it. Most work is performed in user-space, so the registration cache is managed in the user-space library, not in the kernel driver. Using the cache for communications This cache is actually nothing but an array, for which every entry contains the following information : • Is the item valid (in use). • Where does the DMA window starts and ends. • What data should be passed to the NIC so that it can perform the corresponding communication. As described in section 2.4.1, when the application wants to use some buffer B to send or to receive data, it first checks if B corresponds to an active DMA window. This just means we go through the array looking for some entry which start and end match those of B. If any is found, we have a cache hit. Else, a new DMA window is created. Some free item of the array is chosen, the start and the end fields are filled, and it is set as a valid entry of the array. Keeping the cache consistent using glibc hooks Now let us consider that the application performs a call to free on a buffer B2 . Before the free is actually performed, the hook associated with the GL IBC free function, free_hook does the following : for each valid element of the array, if either the start or the end fields of the entry are in B2 , the item is invalidated.

That way, if some operation is likely to alter the To prevent those difficulties, we decided to opt memory mapping, any DMA window affected is in- for the second choice. In addition to the array manvalidated, simply by setting the valid field of the aged by the user-space library, we now maintain a corresponding array entry properly. new array in MX driver. This new array contains sufficient information so that VMA S PY 2 hooks 3.4.2 Registration cache design using should be able to determine which DMA windows should be invalidated and which should not. ThereVMA S PY 2 fore this new array actually consists in a subset of Let us see how this can be implemented with the previous array so that we only store : if the entry is valid or not, and the start and the end of the VMA SPY 2 : corresponding DMA window if any. So, when a new DMA window is created this Enabling VMA S PY 2 new driver-space array is filled with redundant inThe eventual goal is to replace glibc hooks, so we formation. must be able to detect when some memory region Then when a VMA S PY 2 hook is executed, it involved in a communication is tampered. The reads all the entries of that driver-space array, and simplest solution we found was to put those spies possibly invalidates those concerned by the mapduring the first communications : ping change that initiated the hook. • Consider some communication on memory range R.

Keeping the user-space informed of DMA window invalidation

• Find each VMA overlapping R.

But as we suggested, when VMA S PY 2 invali• For all these VMAs, if there is already a spy dates a window, it only modifies the corresponding with the magic number of MX do nothing, else entry in the driver-space array. So if some commuput a spy. nication occurs on such an invalidated window, it That way, when the mapping of one of these needs to somehow be informed that it is not valid VMAs overlapping R is modified, the correspond- anymore. ing spy method is called. But when such a modifiWe considered and implemented two apcation occurs on a VMA on which no communica- proaches : tion ever took place, then no hook is called at all, • When looking into the user-space array for thus preventing useless overhead. window matching our current communicaOur next step is to define what would those tion, if some candidate is found, an ioctl4 is spy methods do to maintain cache consistency when performed to explicitly check the validity of a change is detected on a VMA that carries a spy. the window. Validity bitmap While glibc hooks are executed in user-space, VMA SPY 2 hooks reside in kernel-space and are executed on behalf of the MX driver. This let two possibilities : • The VMA SPY 2 hooks could access the userspace array and update it directly. • The hooks can modify a new structure in the kernel. But this means the user-space array must be somehow kept up-to-date as well. The first approach may seem better, but it would mean that the driver would first have to map this user-space array in kernel-space. C HOKE AND W U [15] used that method, which may create race conditions if the user thread is already in use. 4 ioctl

• Using a lazy invalidation method. When a matching entry is found in the user-space array, it is assumed to be valid. The driver first checks that the corresponding entry in the driver-space array was not invalidated since. If not, we do have a cache hit. Else, the driver immediately returns an EAGAIN error to the library so that it figures out that the user-space array entry was not valid, invalidates it, and behaves as in the case of a cache miss. We then took into consideration the fact that it is very unlikely that a DMA window would ever be invalidated. While the first method required an extra system call for each communications, the second one would only require one in the rare situations where the window was invalidated in between. For that reason, we kept the second method.

is a system call allowing an application to communicate with a device

14

Chapter 4

MX P IN C ACHE, an aggressive optimisation of the registration cache Section 2.4.2 already suggested that DMA windows are somehow lacking flexibility. We will first analyze the weaknesses of current caching scheme. Then we will design a much more aggressive approach to address those issues. Eventually we will implement this new approach and integrate it into M YRINET E X PRESS.

reused. comm2 is even worse : we do know that the buffer is strictly included in DMA win 2. All its memory is pinned and the former translation could be reused partially. Last, comm3 shows an example of a buffer covering exactly two consecutive DMA windows. Once again, all memory is already safely pinned and translated, but no cached data will be used. Therefore, if MX might have a noticeable per4.1 Limitations of the previous formance improvement when using this caching scheme, there are many drawbacks that we should registration cache design address here. Ultimately, a respectable goal would In that section, we shall try to understand what be, not only to increase the cache hit ratio, but also DMA windows lack. Then we should introduce to reduce the cost of cache misses. This should some basic sketch of the way the NIC actually per- neither add too much overhead to our caching forms DMA in order to understand the require- scheme as there is never any use for a cache where even hits are more expensive than a straightforments for improving the caching scheme. ward solution.

4.1.1

Granularity issues with DMA 4.1.2 windows

Preliminary zoom on zero-copy low-level implementation in MX

It indeed appears that DMA windows are a little too rigid, and could be improved. Let us consider the In order to understand how the cache may be imcase of a process that would perform communica- proved, we must first understand how cached data tions on slightly different buffers : corresponding to DMA windows are handled at lowlevel. comm 1

comm 2 comm 3

DMA win 1

What is done when memory is registered

DMA win 2

Let us consider a simple situation : some userspace memory buffer B has to be sent or used for reception. B should overlap 8 virtual pages which starting address should be v0 , . . . , v7 respectively. Figure 4.1 shows three situations where MX For the sake of simplicity, we may also assume that will face a cache miss while we could have hoped a page should be large enough to contain exactly 4 some use of the cached data. comm1 is slightly addresses (in reality this should be 512 or 1024 adlarger than DMA win 1, the translation will not be dresses per page). Figure 4.1: Lacks of DMA windows

15

Since the NIC DMA engine needs the physical addresses of all those pages, MX maintains a list of those translated addresses into a somewhat usual structure1 : some directory page contains pointers to other pages, each of those secondary pages would then contain 4 physical addresses.

• Each entries of that page are page physical addresses. The DMA engine is given these addresses. It may then copy the content of these pages from memory to the network.

4.2 B v0

v1

v2

v3

v5

v4

v5

v6

v7

After making the registration cache much more robust in section 3.3, we will now work on making it more efficient by addressing some issues encountered with the DMA window based registration cache. After showing which issues could be solved, we shall introduce a new design for the registration cache facilities. This proposition should then be implemented into MX to validate our design. We will eventually look how this changes the way memory registration is performed.

MMU

Directory

Second contribution : MX P IN C ACHE

p5

p5

4.2.1

Figure 4.2: Filling DMA tables

Figure 4.2 shows how this works. For each virtual pages i, its virtual address vi is translated into a physical address pi using the Memory Management Unit (MMU). pi is then copied in the corresponding entry of the DMA table. In our example, we have 4 entries per page, so the sixth entry must be put in the second page. The address of that page is retrieved from the directory page, then the p5 physical address is put at the second position in that page. Thus once all virtual pages have been handled, the directory page points to a set of pages containing all the addresses of the physical pages involved in the communication. For brevity reasons, we will assume a single page may be sufficient for the directory.

Overall concept

We will now present the different guidelines for designing MX P IN C ACHE, with regard to our actual goals and trying to avoid what made other approaches somehow weak. Goals of MX P IN C ACHE

First let us note that provided the limited amount of time available, it would not have been possible to redesign all M YRINET E X PRESS from the library down to the firmware. Therefore, we had to find some approach which scope of changes would be quite limited. We should thus change MX transparently for the library and the firmware, so that we were only authorized to modify the driver. The major objection done to DMA windows were their lack of flexibility. Cache hit would only occur when matching the exact same buffers for How the NIC DMA engine handles DMA tables different communications. Let us consider a communication on a We now consider that the DMA table is filled and buffer covering virtual memory addresses range that the NIC can perform the corresponding com- [1000, 3000[. Performing a communication on munication, which is assumed to be a send opera- [1500, 2500[ afterward should not turn into a DMA tion : window cache hit as it would mean to change the • The host copies the directory page into the library behaviour, but the subsequent cache miss should be somehow less expensive than if no preNIC. vious communication had been performed. Sim• For each valid entries of the directory page, it ilarly, sending the [2000, 4000[ buffer should be fetches the corresponding page. more efficient as well. 1 this

is very similar to page tables

16

For now, let us forget about the second point. When defining what is cached and what is According to our previous section, some interval- not, the interval based approach is really simple as sized granularity could be a straightforward ap- there are only two methods that have to be impleproach to answer the question : ”did that memory mented : range was already involved in a communication so that I could reuse cached information ?”. • When a communication is performed on a Such a granularity would provide a very effiunregistered area, add an interval for that cient management as it could be possible to conarea. sider invalidating a given memory range rather • When some memory range must be invalithan just all pages on that range. We do not want dated, remove any interval on that zone. to flush all cache corresponding to the heap as soon as a malloc is performed either. One of the actual challenge to design MX P IN C ACHE was that it should ultimately be used to supply a list of page address to the NIC. So not only it is interesting to manage it with an intervalsized granularity but also it should be possible to extract data corresponding to a given page. Figure 4.3: Caching and invalidating intervals It turns out that the firmware current granularity is not paged-size, but defined by the number of addresses fitting in a page. As shown on figure 4.2, As figure 4.3 suggests, there is no need for the NIC can handle up to 1024 pages at once. In maintaining overlapping intervals. Indeed two the following we will designate those pages con- overlapping intervals would simply both signify taining lists of addresses as metapages. that this area is registered, while such a redunSo we will try to perform most maintenance dancy is useless here. As with VMAs, adding tasks at interval level while using cached-data at an interval just next to another will have them metapage level. merged. When invalidating some area, we do not totally suppress every overlapping interval, but we only 4.2.2 Design of MX P IN C ACHE get rid of the concerned part of them. So that we We will now present how we addressed the limi- might partially invalidate some interval, instead of tations of current registration cache, with regard to throwing our cache away without any reason. We shall now see how such a design might alour previous goals. low to store cached data as well. Which granularity ?

Caching and invalidating cache

Handling cached data

When designing a cache infrastructure, not only should one store data, but also should it be possible to find which data are cached and which are not. Actually, we already encountered a structure having almost the same constraints and the same goals : VMAs. It is indeed intended to manage processes memory using an interval model. Using VMA, it should also be possible to determine whether or not a virtual address is associated with an actual memory area. This is equivalent to determining if an address is in the cache or not. We will thus somehow mimic VMAs. Still, we do not need all of VMA features. Given a virtual address, the only required information is : • Is that address in a registered area ? • If so, what are the cached translations ? 17

We just saw that it is possible to efficiently determine whether or not a virtual address is on a registered area. We will now analyze how it is possible to retrieved the corresponding cached data, so that we can give the NIC a correct DMA table as on figure 4.2. Ideally, we would have directly stored the DMA tables as produced on figure 4.2 but this create some issue as the metapages we mentioned have to be actual physical pages. Let us consider two communications, where the second buffer is starting one page after the start of the other. In both cases, the first pointer of the first metapage would be the page containing the start address. This means that the addresses stored in the metapages would be offset by one between the first and the second communication. As a metapage has to match a physical page, the

Provided that S and n are usually powers of 2, this indexing is extremely efficient. Moreover, the cached metapages, and those given to the NIC in the DMA tables only differ from an offset. Constructing those tables thus only require at worst copying 2 partial cached metapages, which is way more efficient than reissuing the complete pinning of all these pages one by one. Adding new address translations in the cache is really simple as well.

second communication cannot reuse the previous metapages. But if all stored metapages had the same offset, there would be no problem as one could easily reuse or even insert cached data. Com 1

Com 2

1 2 3 4

2 3 4 5

5 6 7

6 7

Figure 4.4: Alignment constraint

Com 1

Com 2

1 2 3

2 3

4 5 6 7

4 5 6 7

For instance, on figure 4.4, where each entry of the metapages contains the address of a page. Pages 4 and 5 are either on the same metapage or on dif- Figure 4.5: Metapages respecting an alignment ferent metapages. As we only store these cached constraint addresses once, there has to be a way to enforce having always the same set of pages indexed on a given metapage regardless of the communication Toward an efficient data-structure patterns. Our next challenge was to find an efficient data structure that would make possible to not only metapage space manage the intervals to determine what is cached and what is not, but also our metapages that conSo we decided that cached metapages should be stored with an additional alignment constraint : if tain the actual cached information. The implementation of such a data-structure is there is room for exactly n addresses per page, the first entry of any stored metapage should point to a detailed in annex D. page which index is a multiple of n. That way, we constructed a new partition of virtual memory into metapages that we shall now 4.3 Modifying MX to take adcall metapage space. vantage of MX P IN C ACHE Let us have an example of how this helps in our situation, and suppose that : We will now see how this new facility can be used by the MX driver. • Physical page have a size S (usually 4KB). • There is room for exactly n pointers on each physical page (usually either 512 or 1024). Provided our previous work on intervals it could be possible to determine that address v belongs to a registered page. Finding the physical address is then done by finding the proper metapage which is : v nmetapage = nS

4.3.1

Should it replace DMA windows cache model ?

The first question is whether it should ultimately replace completely the former registration cache model, and if we should therefore remove the notion of DMA windows from MX. Even though it could be doable, not only it would mean changing both the library and the firmware in addition to the driver, but it would also The physical page address v is then the k th enhurt performance. try with Indeed if DMA windows suffer some corner v k = [n] cases that we addressed here, it still catches a lot S 18

of communication schemes where it can help a lot. information such as an offset, we chose not to do it Thus instead of just displacing the registration cache for now. from the library to the driver, we maintain both in a Future work could therefore change that bemulti-level registration cache : haviour but it would mean to apply much more changes to M YRINET E X PRESS. It also introduces • To keep a very efficient cache hit situation. really challenging issues as well, so that cache in• To avoid useless cache invalidation, thus me- validation would not be as simple as invalidating an interval in a data structure, but the NIC would poschanically raising cache hit ratio. sibly have to be informed in case it was using those • To make the cache miss handling an order of data. magnitude more efficient than before. So for now we concede a little performance for the sake of simplicity and the limited amount of Eventually, combining both DMA window regavailable time. istration cache and MX P IN C ACHE should make possible, for long time running application to only register memory once, and to only have cache hits or very inexpensive cache misses which cost would tend to be similar to those of cache hits.

4.3.2

4.3.3

While VMA Spy 2 makes possible to have a ro-

Could the NIC use cached data di- bust cache, MX P IN C ACHE offers an efficient approach. These two notions should then be clearly rectly?

We mentioned in section 4.2.2 that it would be very easy (and efficient) to rebuild DMA tables starting from the cache by issuing some memory copies. Unlike L INUX page cache, the cached data of MX P IN C ACHE are not intended to be used directly by the NIC. Instead, we chose to explicitly keep reconstructing the DMA tables by the mean of rather cheap copies, compared with former registration costs. This has three major advantages : • this is perfectly transparent outside the driver so that neither the firmware nor the library will have to have the slightest modification. • Similarly this solves any possible alignment issues as the cached metapages may not exactly correspond to those given to the NIC. • This prevents concurrency issues, as one might invalidate some cache without affecting the on-going communications. Therefore while it could perhaps be an improvement to directly tell the firmware to use cached metapages, by supplying the NIC additional

2 note

Making VMA Spy 2 and MX P IN C ACHE independent

that Q UADRICS users still have to patch their kernel

19

distinct, and users should be able to choose which optimization they adopt. However, if MX P IN C ACHE is used without having the VMA Spy 2 hooks keeping up to date cache, there must be some way to get this cache coherent even though all the cache coherency is managed from the user-space with the support of glibc hooks. Therefore we adapted the corresponding hooks so that it should now perform an extra system call to invalidate not only the user-space DMA windows but also the kernel side of our new caching scheme. If this approach requires some additional system calls, the overhead this induces is not a real problem as such invalidation should not occur inside a critical path. Moreover, the MX P IN C ACHE should allow for a significant performance gain so that eventually this is a realistic trade-off. Last but not least, MX P IN C ACHE being a simple modification to the driver, it makes it eligible for an actual use by M YRINET E X PRESS, while VMA Spy 2 requires patching the L INUX kernel which is somehow problematic for most users, who do not want their production cluster to use experimental patch sets2 .

Chapter 5

Experimentation with both contributions 5.2.1

Now that we outlined those two independent modifications to the former registration cache design, let us analyze if these are actual improvements. After studying how this helps in term of functionality, we should analyze how performance are affected.

5.1

Test environment

Our evaluation will be performed on some of M YRICOM clusters, named fog, shower and rain. Fog has 2 I NTEL X EON at 2.40GHz and 2 M YRINET 2000 network cards on each of its 38 nodes. Each node has 1GB of RAM. Shower is a 4 nodes cluster with 4 D UAL C ORE AMD O PTERON per node, 8GB of RAM, and two M YRI -10 G network cards. Rain is made out of 18 nodes with 2 AMD O PTERON. Each of them has both a M YRINET 2000 and a M YRI -10 G network cards, and 4GB of RAM. This way we should be able to conduct our tests on both 32 and 64 bits architectures, as well as various M YRINET cards. All these machines are running L INUX 2.6.17 to 2.6.22.

In order to first make sure VMA Spy 2 was working properly, we first wrote some trivial device in order to track down any changes in the memory mapping. We made our possible to generate a testsuite that would cover all possible situations, for instance all configurations in which unmapping, or mapping, may it be partially modifying a VMA or a whole set of VMAs. Once this module made possible to show all possible cases were likely to be properly handled, we tested the effect on M YRINET E X PRESS : to make sure all previous features were still valid, we wrote another test-suite where some communications were performed on large buffers corresponding to areas involved in memory mapping change. That is to say, we mapped and unmapped files, possibly partially, while performing communications on these areas. All received data was checked to detect data corruption. This means that VMA Spy 2 made possible to handle all situations previously handled by glibc hooks. So we do not loose any functionality.

5.2.2

5.2

Functionality and correctness : cache consistency under all circumstances

In that section, we will focus on the effect of using VMA S PY 2 as described in section 3.3. We saw that MX already managed to keep its cache consistent in section 3.2. We must first make sure that we perform at least as well as glibc hooks by catching memory mapping changes directly inside the kernel. 20

No regression

A wider range of uses

In section 3.2.2, we showed that glibc hooks suffered some weaknesses linked with the fact that userspace applications may have various behaviours, for instance running an application statically compiled without registration cache while activating it in MX later on. This issue is solved as the kernel hooks execution will not depend on the way the program was compiled. Moreover, the application can implements its own interface for system calls : before, if the glibc functions were not used, the glibc hooks could not be called. Now if the application implements those

calls by supplying the arguments and the system call number, those syscalls will be caught as well.

5.3

ister, so that usually the ratio r should end up being also 1. Ultimately we should therefore end up with :

Performances : speeding up M YRINET E X PRESS

Cregister (S, r ≈ 1) ≈ Chit (S)  Cmiss (S)

This simply meaning that in this very simplistic and optimistic model, the registration cost Now we shall see if our technique to reduce the would boil down to the cost of simple cache hits for cost of cache misses is effective. long running applications. So that this optimization should especially help real applications which are more likely to eventually have registered a lot 5.3.1 Expected results of memory . There are various effects that our MX P IN C ACHE should have on performance. First, in case of a pure cache miss, that it to say 5.3.2 Observed performances on an area that was never registered before, we The following experiment intends to show that would have not only the work of previous implewe gain performance when registering a buffer mentation but also some extra work due to an iniif part of it is already cached. So we measured tial memory copy for the cached version, and varthe amount of time to register a buffer depending ious management. on the ratio of cached data. We considered 2MB However, when dealing with a buffer already buffers, and compared on the one hand MX with partially registered, or even fully registered, the our new cache, and the unmodified MX on the benefit could be important : we could say that other hand. at best, if we neglect all management overhead, The results of that experiment performed on cache effects and so on, a very optimistic evaluathe shower machines are shown on figure A.3 on tion should predict that if a ratio r of the buffer of page 25. As expected, the more data are cached, size S is already registered, re-registering it is as the faster the registration is performed. We obtain costly as a cache hit on that part, while the rest of exactly the behaviour expected in section 5.3.1 : the registration is as costly as the cache miss. Eventually this gives : • If the buffer was not in the cache at all, then some additional work is required to not only Cregister (S, r) = Chit (rS) + Cmiss ((1 − r) S) pin the buffer, but also cache it. We have an extra overhead of 10% for that 2MB buffers. Where Cmiss and Chit are the cost for pure cache missand hit respectively. For large messages, these • If the buffer was fully cached, registering costs are almost linear so that we should have at it again only takes 20% of the previous best : time. This means using our cache would ultimately make possible to register data as Cregister (S, r) = rChit (S) + (1 − r) Cmiss (S) much as 5 times faster on this machine. Before going any further let us consider the case of a pure cache hit. Such a situation for instance happens when a buffer B2 is registered and that we then register B2 such that B2 ⊆ B1 . In such a case, the only work to do is to take the content of the metapages into a new DMA table. So this is very efficient as up to 1024 pages are handled at once, using optimized memory copy methods, compared to re-pinning each and every pages. Thus we may admit that we should have Cmiss  Chit . Eventually, as the application keeps running for a while, most memory should end being reg-

21

• As the ratio increases, the performance of our cache gets better. On this platform, when more than 20% of the buffer is already registered, there will be a substantial gain, improving as the ratio gets closer to 100%. This experiment shows that it is possible to significantly improve cache registration performance by using such an aggressive caching technique. Even though those numbers could be even better with further optimization work, one should remember that our cache competes against some deeply optimized software.

Chapter 6

Conclusion • To get familiar with M YRINET facilities, from the use of high-level tools, down to very low-level matters such as PCI BUS programming, we developed an Ethernet packet generator firmware for M YRI -10 G cards. This being extremely instructive but not really appropriate for that document, it was not presented here, even though it was a great way to start this internship and was found to be actually needed.

We will now wrap up that document by summarizing our work, showing some of its current lacks, and how it could still evolve later on.

6.1

What was done in this document

After studying the lacks of zero-copy and its requirements, we showed that memory registration was an issue. We mostly introduced two ideas to help improving this :

• As this internship was also a great opportunity to discover kernel programming, we are now also working on the support of the PAGE ATTRIBUTE TABLE feature on X 86 processors in the L INUX kernel. This was also rather irrelevant for this document but it seems that such a work should allow a clean use of write coalescence cache policy, fundamental for PIO performance as shown by A. R AOUL ET AL .[2].

• Moving cache consistency management from user-space to the kernel, making this scheme more robust and less restrictive. This required modifying the core of the L INUX kernel, and to modify both MX library and MX driver. • Finding a proper granularity, and putting a new level of registration cache in the kernel, so that avoiding redundant overhead should help reducing the cost of cache misses while not affecting cache hits. We manage to have this change only affect the MX driver.

6.3

These techniques put together, or even taken separately, should help making zero-copy cheap for the host CPU, reaching the goal it was initially designed for.

Future works and perspectives

There are of course various things that could be improved throughout our work. Our goal with VMA S PY 2 was more a proof of concept than a production ready patch to L INUX : there are numerous corner cases that were not addressed for the moment. For instance we do 6.2 What was not presented in not handle some memory hot-plug problems, and other such very unsuspected issues. Fixing them this document would require a lot of work throughout the kernel, First, most of the actual technical difficulties were and it is nothing but a technical issue. hidden in that document so that we should conThe biggest issue not addressed here is cercentrate on the ideas. tainly an efficient handling of memory deregistraBesides, in addition to this work on the regis- tion : all mechanisms are available for that, but tration cache in M YRINET E X PRESS, we worked on it ends up to be extremely difficult to determine various unrelated tasks : when to throw the cache away, so that it should 22

not consume too much memory. An approach for that could be to use garbage collecting on our cache when the system starts lacking memory. There are of course many other possible improvements, but besides the solving of technical issues, and the need for a better memory usage control, this work could open the way to various future works. Zero-copy is not a new technique, but it still suffers those issues. Nowadays, there are various people claiming that RDMA should solve all problems, this work could be extended to those applications. For the moment, this cache is typically useful for point to point communications, as this cache is defined in MX endpoint structure, which corresponds to a session between two nodes. But all these memory caching information could apply to a per-process basis. Indeed all the mechanisms involved here are performed on a process address space. Such a method could greatly help collective communications : say a buffer had to be sent on n hosts, instead of pinning n times the same buffer, we could pay a little more for the registration once, and make it cheap of the n−1 remaining endpoints. Scatter operations could benefit this as well in case of overlapping buffers. This way, zerocopy used with cache would definitely improve scalability for collective communication with large

23

messages. A major improvement could be to get rid of the need to copy data from the cache. The firmware could be modified to directly use the cache as a global translation table. The current approach is to first copy the cached data to a buffer and then to pass it to the NIC, but with some work, we could get rid of this copy, so that we could do the registration almost in constant time, instead of linear. But this would raise many hard problems. As B. G OGLIN underlined it [7], it could be interesting that the L INUX kernel had some generic facility similar to VMA S PY 2. But even though multiple actors of the HPC community could benefit of this, it is very unlikely this would ever enter the mainline kernel as it implies modifying the core of the memory sub-system, only helping a tiny part of the overall L INUX users. Following Q UADRICS example, we could also simply get rid of the all registration problem by keeping track of changes and informing the NIC before the actual changes are made. Likewise the NIC could directly work with virtual addresses, using our cache to maintain a sub-set of the page tables on the host. Provided that we have a better granularity than Q UADRICS which handles one page at a time, VMA Spy 2 should make this possible even though it would again be a very hard work to completely get rid of memory registration.

Appendix A

Measurement results Duration of memory pining and unpining 100000 pin unpin

Time in micro-seconds

10000

1000

100

10

1 1

10

100 1000 Number of memory pages

Figure A.1: Cost of memory registration

24

10000

100000

Bus Dual P3 1 GHz Dual P4 2.4 GHz Dual Opteron 2.2 GHz Dual 2-core Opteron 2.8 GHz Dual 2-core Woodcrest 3 GHz

System Call 340 ns

Memory Copy 250 MB/s

460 ns

1.7 GB/s

77 ns

2.1 GB/s

PCI-E 8x

71 ns

PCI-E 8x

86 ns

PCI 64/66 PCI-X 64/133 PCI-E 8x

PIO

Registration

Threshold

88 MB/s 213 MB/s 1100 MB/s

3.3 us + 0.26 us/pg 5.8 us + 0.28 us/pg 3.0 us + 0.23 us/pg

1.3 KB

2.2 GB/s

1100 MB/s

2.9 us + 0.19 us/pg

14.6 KB

3.2 GB/s

1000 MB/s

0.6 us + 0.09 us/pg

3.6 KB

23.3 KB 11.7 KB

Figure A.2: Cost of memory registration on different architectures

Registration of 2MB depending on precached data ratio

Registration cost divided by the cost without MX pin cache

1.2 With MX pin cache Without MX pin cache

1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

10

20

30

40 50 60 Cached data ratio (%)

70

80

90

Figure A.3: Performance of MX P IN C ACHE depending on cached data ratio

25

100

Appendix B

L INUX memory management Throughout that document, we dealt with L INUX memory subsystem. In order to understand the basic concepts involved, this annex should present very briefly some notions that may be useful for the reader. Note that there are a lot of good references that handle this topic in much more detail [9, 3].

B.1

VMA and process address space

When dealing with process address space, it is crucial to be able to say if for a given virtual address, there is an object on that position, or not. We must thus define the notion of memory region. L INUX implements those memory regions by the mean of objects named vm_area_struct each of them describing a given region, or VMA. A VMA can either correspond to some mapped file, or be anonymous. The former case occurs when a file is mapped by a process. Anonymous VMA are used when the corresponding data does not correspond to a file : for instance we have a VMA dedicated for the heap. A process address space is described by a list of those VMA, each carrying information about the corresponding memory area : the start and the end of the region, the access permissions for the region, and various other fields.

26

B.2

Paging and virtual memory

However, VMAs only deal with the abstraction offered to the process : a single chunk of continuous memory for the overall process address space. Underneath, the physical organisation of memory has nothing to do with this. In order to offer the illusion that all memory is contiguous from the process point of view, L INUX uses the virtual memory features available on modern processors. Those processors have a Memory Management Unit (MMU), which is able to translate virtual addresses into physical ones. Not only this makes possible to give all processed the abstraction of a continuous address space (possibly larger than the actual physical memory), but also this makes possible to enforce some security policies. As said previously, it is possible to offer processes more memory than actually available. When a virtual address is translated , it may be that instead of a physical address, an invalid address is returned. In such a situation, L INUX handles this as a page fault, as this means the virtual address does not correspond to any physical address. If memory lacks, L INUX might take some physical page in use and dedicate it to another set of virtual addresses by the mean of the swapping mechanism. This shows that the relationship between virtual and physical addresses is very weak and might be changed completely transparently for the user applications. For that reason, we have to watch those changes carefully when giving some physical addresses to the NIC, which is the motivation for all our current work.

Appendix C

Notes for implementing VMA SPY 2 in the L INUX kernel We will now show how we modified VMA S PY to improve it so that it could fit our requirements better.

With regard to figure 3.1, where the do_munmap kernel function is unmapping [B, C[ out of a [A, D[, one may try to simply modify the hook in do_munmap so that it only reports the actually unmapped range, namely [B, C[ and not [B, D[.

Using VMA S PY as a code base As one may expect, VMA S PY 2 was meant to be an evolution of the old existing VMA S PY code. Since this work was done for a 2.6.11 kernel, and that by the time of this writing, L INUX current version is 2.6.22 it meant to change a little the former implementation even though it was very limited. Let us first describe how the former version of VMA S PY is used. It is possible to have multiple drivers using VMA S PY at the same time. For instance on figure C.1, we have two drivers spying either the first or the third VMA or both. On each spied VMA, a spy is first put on a VMA. When a driver wants to monitor the activity on a given VMA, it puts a type structure, all types being linked together. For instance when the first VMA is unmapped, a spy is detected, and the unmap method of both types are executed sequentially. Driver

Driver

type type

type

spy VMA 1

spy VMA 2

VMA 3

Figure C.1: VMA Spy 1

Let us then consider this behaviour : when [vstart , vend [ is to be unmapped, some hook in do_munmap calls the corresponding munmap method specifying the correct [vstart , vend [ address range to the driver. This seems a reasonable behaviour, but it will actually not work for now. Indeed when [B, C[ was unmapped, the first step performed by L INUX was to split [A, D[ at B. When looking at the L INUX split_vma function, in order to split [A, D[ into [A, B[ and [B, D[, L INUX has to unmap [B, D[ and to create a brand new VMA on [B, D[. So whatever it will do afterwards, at some point, the do_munmap function is called on [vstart = B, vend = D[, and given the suggested behaviour, this would tell VMA S PY 2 to report that [B, D[ is unmapped. Here we have some strange situation as we do know that the first unmap performed on [B, C[ would be meaningful, but we also know that the unmap of [B, D[ would immediately be followed by the creation of a VMA. And from do_munmap perspective, there is no way both these unmapping can be distinguished. So we had to find a way to express the fact that some do_munmap is meaningful while everything that will follow is not, and should be hidden to the spying drivers. The solution that came up was to give VMA S PY 2 a stateful design.

Making VMA S PY 2 state-full Now, for the sake of simplicity, let us concentrate on the previous partial unmapping problem in the following. 27

To do so, a vm_spy_active flag is added to the VMA structure so that when it is set, any change is reported by VMA S PY 2. When it is not set, some memory change may occur, the spying drivers will not be informed. Considering the former example, when the split occurs, the corresponding hook will now essentially do :

so that it is watched as well. Similarly, we saw that when two VMAs were merged using VMA S PY, neither of them was eventually spied anymore. This can also be solved by a similar technique, moving the spies from the VMA to be removed to put them into the VMA to be extended. We thus added a real support for either VMA splitting or VMA merging, both of them lacking in if (vm_spy_active) the former VMA S PY implementation. notify_vma_spy_unmap One may object that this increases the overBy managing that vm_spy_active properly, head of VMA S PY 2 since it adds some memory we can thus address the issue we encountered in copying. But on the one hand, using our state-full section 3.3.1. For instance we would report the implementation, we avoided useless copies since meaningful unmap, reset the flag, let L INUX do all spies are transferred if and only if this is necessary. its black magic silently, and set the flag back. On the other hand, not only VMA are rarely modUnfortunately, if this does solve our first prob- ified, but also, the number of spies is quite reduced lem, this makes VMA S PY 2 somehow more com- since it should be of the order of the number of plicated than VMA S PY which basically had to spying drivers, thus usually at most one. track any call that could change the mapping, while now we do have to understand not only the underlying mechanisms, but also the actual se- Avoiding spy types clones mantic of the way L INUX manages memory. And There still remains a small issue : when a VMA is even though there are numerous references help- split, spies are copied out on a brand new VMA, ing to understand the core of L INUX memory sub- which makes no problem. But considering mergsystem [9, 3], this required a lot of work to catch up ing two VMAs or more, it could be that both these the proper mapping modifications and to hide the VMAs have a spy from the same driver, so that underlying mechanisms. there will be twice the same spy upon a VMA. If Moreover, making VMA S PY 2 state-full made such a situation kept repeating, with some VMA it even more sensitive to concurrency issues : if the continuously being split and re-merged, the number vm_spy_active flag is not managed correctly, of identical spies may explode. we might have race conditions, possibly horrific for As we want a generic scheme, we should take the spying drivers. It could for instance lead to un- into account that some drivers could put spies that reported mapping changes. are especially related to some memory area, so that Having to understand the actual semantic of the when two VMA are merged, the driver expects to memory sub-system also makes it harder to main- still have both spies doing what they were expected tain as the kernel keeps changing continuously. to do. Spy duplication and spy migration If making VMA S PY 2 state-full did reduce the amount of useless cache invalidation, there are still some issues left. Once again, let us consider the figure 3.1. VMA S PY 2 would now only inform its spying drivers that [B, C[ was unmapped while the L INUX kernel would still unmap and then remap [C, D[ into a brand new VMA. But the former VMA S PY did not take that situation into account so that it let the new VMA without any spy on it. Since the two VMAs resulting from such a partial unmap, one would expect them to have the very same properties. For instance, all [A, D[ was watched before the unmap, but after the unmap, only [A, B[ is, whereas [C, D[ is not. To address that issue, when some VMA is split, we now copy its spies onto the newly created VMA 28

To address that issue without restricting VMA S PY 2 features, we added yet another flag may_duplicate to the spy type. If a merge is performed, when the spy types are moved, those which do not have a may_duplicate flag set are copied if and only if there is not already some spy type with the same magic number spyer, that indicates which driver is spying. This requires to go through the spy type list to make sure we can copy a spy type, but provided we expect at most a couple of spying drivers per VMA this is not a problem. In addition, this limited overhead makes possible to avoid calling several times the same hooks so that it is worth wasting a few CPU cycles now to save them later. Noting that if the may_duplicate flag is set, there is no overhead and drivers may put several spy types on a single VMA. Eventually, this small trade-off did solve our issue without hurting performances.

Appendix D

MX P IN C ACHE implementation

start end prev next meta

last start end prev next meta

Metapages

start end prev next meta

cache

Intervals

first

Figure D.1: MX pin cache Container data structure

We will now describe how MX P IN C ACHE ements are sorted and non overlapping so that figwas implemented in the M YRINET E X PRESS driver. uring if an address is cached can be performed by After explaining the design of the associated data simply going through the list. structure, an example of its use will be shown. A cache entry was added so that the search starts from the last accessed element, possibly going backward. D.1 The container data structure When an interval is registered, all the overlapping intervals are merged into a single one. This As presented in section 4.2.1, MX P IN C ACHE was makes interval creation a little more expensive, but designed to store 2 types of data : intervals to de- it is crucial for search performance when performtermine what is cached, and metapages to store the ing a cache lookup afterwards. actual cached data. Note that on top of this double linked list, it The data structure we designed for such a pur- could have been possible to maintain a red-black pose was called a container. An example is shown tree to perform search in logarithmic times instead on figure D.1 of linear with regards to the number of intervals. This could be a really important point to handle fragmentation issues. D.1.1 Interval manipulations Besides maintaining all that list structure, evStrongly inspired by VMAs, the top part of the ery interval contains information about its start figure represents the interval management : we and its end. It also points to the metapage pointers maintain a doubly linked list of intervals. All el- array. 29

D.1.2

Metapage handling

Now that we are able to determine which part of memory is cached, we have to store cached data. A very simple approach would have been to statically allocate an array with all the possible metapages. As each metapage indexes 4MB on a 32bits system, this would require no more than a thousand page to represent the overall 4GB address space. This would therefore require 4MB per endpoint. Now consider a 64bits system. Each metapage would account for 2MB, and the address space would now cover 4 billions times more memory than in the 32bits case. It now becomes completely unrealistic to represent our metapage space with a simple array, and it thus has to be managed dynamically. And we chose to have the metapage handling fully dynamic even on 32bits systems. So each interval points to a local metapage pointers array, which makes possible to find the nth metapage used by the interval. This makes the indexing of metapages really simple as one just need to compute the index of the metapage, the offset in that local pointer array with regard to the interval start, and then retrieve the pointer to cached data. And while figure D.1 represents all metapages as an array, this is only a logical representation, as each entries are allocated when needed. As suggested by the figure, several intervals may refer to the same metapage. So we had to keep a reference count for each of those metapage containing data. When a metapage is referenced for the first time, it must be allocated. When a metapage is already in use, we have to increment that reference count. The trouble came when we had to merge several intervals, or to split an interval when some part of the cache must be inval-

1 else

idated. When no interval refers to the metapage anymore, it has to be freed after the corresponding physical pages are unpinned1

D.2

MX P IN C ACHE in action

Now let us consider how this cache can be used. During a first communication, there is no element in the list, so a new interval has to be created. Likewise, no other interval refers to the corresponding metapages. Those also have to be allocated. All corresponding virtual pages are pinned and their physical address are stored in the allocated pages. When a new interval is registered, the list is inspected to figure out which interval comes before, which comes after and which must be merged if any. In case any interval must be merged, the corresponding metapages are not reallocated, and only the new virtual pages are pinned, since the other were already pinned previously. When a change is detected in the mapping, and that a memory range has to be invalidated, the list is inspected to find out which interval overlaps this range. All those are either split or completely removed. The invalid pages are unpinned and removed from the cache. And all corresponding metapages are deallocated, if and only if there is no – still valid – interval referring to them. Now when a buffer must be registered, a lookup is first performed in the cache. All the yet unregistered parts of that buffer are added to the cache, and then the data are directly fetched from the cached metapages, instead of reconstructing them from scratch.

physical page would be kept in memory forever, even after process death, thus leaking memory

30

Bibliography [1] David Addison. Linux vm hooks for advanced rdma nics, 2005. Linux Kernel Mailing List. http://lists.linuxcoding.com/kernel/2005-q2/msg09147.html. [2] Raoul A. F. Bhoedjang, Tim Rühl, and Henri E. Bal. User-Level Network Interface Protocols. Computer, 31(11):53–60, 1998. [3] Daniel P. Bovet and Marce Cesati. Understanding the Linux Kernel. O’Reilly, 3rd edition, 2005. [4] C. Dalton, G. Watson, C. Calamvokis, A. Ed-wards, and J. Lumley. Afterburner: A network independent card provides architectural support for high-performance protocols. IEEE Network, 7(4):36–43, 1993. [5] Patrick Geoffray. A critique of rdma, 2006. http://www.hpcwire.com/hpc/815242.html. [6] Brice Goglin. Réseaux rapides et stockage distribué dans les grappes de calculateurs : propositions pour une interaction efficace. PhD thesis, École normale supérieure de Lyon, 46, allée d’Italie, 69364 Lyon cedex 07, France, October 2005. 194 pages. [7] Brice Goglin. What hpc networking http://www.hpcwire.com/hpc/811570.html.

requires

from

the

linux

kernel,

2006.

[8] Brice Goglin, Olivier Glück, and Pascale Vicat-Blanc Primet. An Efficient Network API for inKernel Applications in Clusters. In Proceedings of the IEEE International Conference on Cluster Computing, Boston, Massachussets, September 2005. IEEE Computer Society Press. [9] Robert Love. Linux Kernel Development. Novell Press, 2nd edition, 2005. [10] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), April 1965. [11] H. Tezuka, A. Hori, and Y. Ishikawa. Pin-down cache: A virtual memory management technique for zero-copy communication. In IPPS ’98: Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium, page 308, Washington, DC, USA, 1998. IEEE Computer Society. [12] T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-net: a user-level network interface for parallel and distributed computing. In SOSP ’95: Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 40–53, New York, NY, USA, 1995. ACM Press. [13] Matt Welsh, Anindya Basu, and Thorsten von Eicken. Incorporating memory management into user-level network interfaces. Technical Report TR97-1620, 13, 1997. [14] Wm. A. Wulf and Sally A. McKee. Hitting the memory wall: Implications of the obvious. Computer Architecture News, 23(1):20–24, 1995. [15] P. Wyckoff and J. Wu. Memory registration caching correctness. In Cluster Computing and the Grid, 2005. CCGrid 2005. IEEE International Symposium, pages 1008 – 1015, Washington, DC, USA, 2005. IEEE Computer Society.

31