High-speed and memory efficient TCP stream scanning ... - Xun ZHANG

and memory efficient TCP stream level string matching using FPGA. Packet loss and ... For an accurate matching, TCP-stream-level matching is necessary.
84KB taille 1 téléchargements 287 vues
HIGH-SPEED AND MEMORY EFFICIENT TCP STREAM SCANNING USING FPGA Yutaka Sugawara, Mary Inaba, Kei Hiraki Department of Computer Science University of Tokyo 7-3-1 Hongo Bunkyo-ku, Tokyo, Japan email : sugawara, mary, hiraki @is.s.u-tokyo.ac.jp ABSTRACT In this paper, we propose methods to enable high-speed and memory efficient TCP stream level string matching using FPGA. Packet loss and inconsistent retransmissions are handled without dropping packets. Received packets are processed in their arriving order to reduce the buffering memory size. Consistency of retransmission packets is checked using hash value comparison. We evaluate the proposed system using Xilinx XC2VP100-5 FPGA. A 40Gbps network is supported by the proposed system with 140MB memory usage under a realistic traffic pattern. In addition, the proposed system realizes 39.3Gbps packet-processing throughput for a 1017 characters rule, and 1.85Gbps throughput for a 16375 characters rule.

anattack find

split

matching unit

: packet

an at tac k cannot matching ??? unit find rules = { ’the’, ’attack’}

Fig. 1. A pattern that is not discovered by per-packet matching

matching

stop

Must be processed appropriately

… anidea ofatta ckanda ttacka ndmor … packet loss TCP sequence number

: packet

1. INTRODUCTION In parts of network applications, packet payload analysis is necessary. Examples of such applications include network intrusion detection systems (NIDSs) and contents based quality of service (QoS) control systems. For instance, in NIDS, contents of packets are analyzed to check whether specified attack patterns are contained in them. In payload analysis, FPGA based methods are advantageous because both high speed and flexibility are realized. To support recent 10Gbps-class networks, high-speed payload analysis is necessary. Therefore, a hardware-based method is required. However, analysis policy update is also necessary. For example, in NIDS, policy update is necessary to cope with new attack methods. Using FPGA, policy is updated by reconfiguration, while enabling high speed. In this paper, we discuss exact string matching using FPGA. Exact string matching is a major method to analyze packet payloads. In exact string matching, an analysis system checks whether packet payloads contain one or more This research is partially supported by the Special Coordination Fund for Promoting Science and Technology, from Ministry of Education, Culture, Sports, Science and Technology Japan, CREST project of Japan Science and Technology Corporation, and by 21st century COE project of Japan Society for the Promotion of Science.

0-7803-9362-7/05/$20.00 ©2005 IEEE

Fig. 2. Stopped matching due to a packet loss

of pre-defined octet strings (rules). This type of analysis is used in NIDSs like Snort [1] and Dragon [2]. For an accurate matching, TCP-stream-level matching is necessary. In old systems, matching is performed in a per-packet manner. As a result, the systems cannot discover patterns that span multiple packets, as shown in Figure 1. In TCP-stream-level matching, packet loss handling is necessary. When a packet is lost, string matching is stopped before it, as shown in Figure 2. To obtain a correct match result, succeeding packets must be processed appropriately. With existing methods, packet loss handling is difficult when the network is fast. To handle packet loss by buffering succeeding packets [3] [4], large size memory is necessary. When packet loss is handled by dropping succeeding packets [5], the TCP throughput is reduced. In addition, in existing methods, the match result is not correct when there are inconsistent retransmissions. The term ”inconsistent” means that the payload data is different from the first transmission data. In this paper, we propose the out-of-order matching

45

method. Using this method, memory usage for packet loss handling is reduced without dropping packets. Packets are matched in their arriving order to minimize buffering. In addition, we propose the packet fingerprint method. Using this method, inconsistent retransmissions are detected and blocked. Hash value comparison is used to detect inconsistent packets with a small memory usage. This paper is organized as follows. In Section 2, we describe the limitations of exiting TCP stream scanning methods. We present details of the out-of-order matching method in Section 3. The packet fingerprint method is explained in Section 4. We present an implementation of a TCP stream matching unit using the proposed methods in Section 5, and evaluate it in Section 6. Related studies are shown in Section 7, and we conclude this paper in Section 8.

Possibilities for actual accepted data:

first transmission at ta ck retransmission1 is sa fe retransmission2 no rm al TCP sequence number

attack, attafe, attaal, atsack, atsafe, …….

Cannot be identified

: packet

Fig. 3. An example of inconsistent retransmission fragment l-1 l-1 l-1 TCP sequence number :hole

:packet

fragment l-1

l-1

:octets to be saved

Fig. 4. Saving boundary octets of each fragment 2. LIMITATIONS OF EXISTING METHODS 2.1. Handling Packet Loss In existing TCP stream matching methods, when a packet is lost, the succeeding packets are buffered or dropped. In the buffering method [3] [4], out-of-order packets are stored to a buffer to sort them in the sequence number order. The sorted data is fed to a string matching unit. When the next octet is not available in the buffer, the matching stops until it has been received. In the buffering method, all the in-flight packets are saved in the worst case, increasing the memory usage. Specifically, the memory usage is [round trip time (RTT)] [network bandwidth]. In a real example of RTT 200ms and bandwidth 40Gbps, the memory usage is 1GB. Since the network speed grows faster than the memory capacity, this problem will become more serious in future. In the packet dropping method [5], the system drops all the succeeding packets of a lost packet. Then, it waits for the lost packet and the dropped succeeding packets to be retransmitted. As a result, the matching system eventually receives the lost packet and the succeeding packets in sequence number order. In this method, since many packets are dropped, the communication throughput is reduced because of TCP flow control mechanism. In the popular TCP implementations, TCP-Reno/newReno, it requires RTT [number of packet drops] time to recover from a packet loss [6]. In addition, the congestion window (CWND) is reset to the minimum value if the retransmission timeout (RTO) has been reached. 2.2. Handling Inconsistent Retransmission It is possible to retransmit a TCP packet whose payload data is different from the first transmission data. Though such inconsistent retransmissions are prohibited by the TCP standard [7], there is no physical mechanism to block such in-

consistent packets. Therefore, network users are able to send inconsistent packets. With existing methods, when inconsistent packets are sent, match result is not correct. For example, in the situation set out in Figure 3, there are a total of 27 possibilities for the data sequence finally accepted by the receiver. In such a situation, actual data accepted by the receiver is not decidable [8]. Using this weak point, senders can cause the matching system to return a wrong answer. Therefore, existing methods cannot be used for applications like NIDS that rely on the accuracy of the matching. 3. OUT-OF-ORDER MATCHING METHOD In the out-of-order matching method, the system scans packets in their arriving order to minimize buffering. That is, each packet is scanned immediately when it is received. As a result, the system discovers all patterns completely contained in the packet. At this point, all the work remaining for this packet is just to find the patterns that cross the packet edge. Thus, the system does not have to save data unnecessary for finding such crossing patterns. We propose two schemes to find patterns crossing an edge of packets. One is the two-edge buffering scheme, and the other is the one-edge buffering scheme. 3.1. Two-edge Buffering Scheme In the two-edge buffering scheme, the system holds octets at edges of each fragment to discover patterns crossing the fragment edge. Specifically, when a fragment is large enough, the octets of both the edges are saved, as shown in Figure 4, where is the length of the longest rule. This is because all the crossing patterns are included in the octets. Since of practical rule sets is much smaller than the

46

scan

scan

arrived S1

scan

scan

TCP sequence number :hole :new packet

TCP sequence number S1

:existing fragment

scan

S2

:saved octets

:new packet

:existing fragment

:hole

:matching state

save the final state S2

:saved octets

Fig. 6. Saving the final matching state of each fragment Fig. 5. Representative position relationships between fragments and a new packet fragment sizes, memory usage is reduced. For example, in Snort NIDS [1], the maximum rule length is less than 128 octets. On the other hand, the Internet traffic bandwidth is mostly occupied by packets of more than 512 octets [9]. Since a fragment consists of multiple packets in typical cases, most of the fragment is much longer than octets. If a small number of long rules exist in a rule set, we can remove them by splitting them into a shorter sub-patterns and performing logical-AND operations on the match results. When a neighboring packet of a fragment is received, a string match is performed on the concatenation of the fragment edge octets and the received payload octets. As a result, all the patterns are discovered, including the crossing patterns. Figure 5 shows the ranges of string match and the saved octets for representative cases of the relationship between a receiving packet and fragments. 3.2. One-edge Buffering Scheme When the two-edge buffering scheme is used, there is a redundancy that the last octets are saved and matched twice for most of the packets. This is because most TCP packets arrive in sequence number order. In the one-edge buffering scheme, the redundancy is eliminated by saving the internal state of the string matching unit instead of the ending edge octets of each fragment. That is, when a packet is matched, the state of the matching unit is saved at the ending edge. When a succeeding packet is received, the saved state is restored to the matching unit before the new packet is scanned, as shown in Figure 6. Using this technique, correct matching result is guaranteed without saving the edge octets. For the one-edge buffering scheme, matching state size minimization is important. 4. PACKET FINGERPRINT METHOD To get a correct matching result, it is necessary to identify the actual data accepted by the receiver even when inconsistent retransmission packets exist. The matching must be performed on the octets actually accepted by the receiver. In the packet fingerprint method, ambiguity of received

data is removed by dropping all inconsistent packets. Since inconsistent retransmissions are prohibited by the TCP standard, the packet drop does not affect normal TCP streams. The consistency of retransmissions is checked by means of hash value comparison. A hash value for each packet, called a packet fingerprint, is calculated using a hash function. When a retransmission packet is received, its hash value is compared with that of the first transmission packet. When they are not equal, the retransmission is inconsistent. When they are equal, the retransmission is judged to be consistent. In this case, an inconsistent retransmission may be judged to be consistent because of a hash value collision. However, the collision probability decreases exponentially as the number of hash bits is increased. Specifically, the probability is , where is the number of hash bits. When a retransmission packet boundary does not match the first packet boundary, it is buffered until another packet fills the gap. The buffering does not continue indefinitely because a normal TCP does not retransmit just a part of a first packet. To prevent memory consumption by attacks, old packets are dropped using a timer. Using hash values for consistency checks, the necessary memory size is reduced compared with strict checking, in which all payload data of first transmission packets is saved. 5. IMPLEMENTATION Figure 7 shows our implementation of a TCP stream matching unit using the out-of-order matching and the packet fingerprint methods. To hold the information of many streams and in-flight packets, DDR-SDRAM is used. Each packet is input with the stream ID by an external packet classifier. Using this stream ID, a stream information block (SIB) is loaded from and stored to the DDR-SDRAM. SIB is a memory block that holds TCP stream information. The suffix based traversing (SBT) string matching method [10] is used in the implementation to enable both high matching throughput and small state size. The SBT method uses a state machine in a similar way as AhoCorasick algorithm. Multiple input characters are processed in parallel by decoding the input pattern based on its suffix pattern. As a result, high throughput is realized. In addition, the state size of an SBT matching unit is limited to

47

TCP packet

Word boundary alignment

TCP packet FIFO

SBT matching unit Packet fingerprint calculation

Table 1. Rule sets used in the evaluation number of number of patterns characters

control

stream ID

SIB access In-flight data access

rule0 rule1 rule2

DDR-SDRAM IF

21 58 113

507 1017 2047

number of number of patterns characters

rule3 rule4 rule5

237 512 997

4090 8192 16375

Fig. 7. A matching unit using the proposed methods sequence number (30 bits) 12 8 10

SIB array base

SIB array base

+

data array 256 elements

one-level array

Table 2. Traffic patterns used in the evaluation Original trace average packet size (octets) name in [9] all fast path pattern 0 rly-21.0-021121 735 1125 pattern 1 sj-21.0-020419 630 1044 pattern 2 sj-25.0-021009 764 1148

sequence number (30 bits) 12 8 10

+

pointer array 4K elements data array 256 elements

+

two-level array

Fig. 8. Searching in-flight data information using array , where is the number of characters in the rule. A packet is forwarded to an external software slow path when; (1) it is a retransmission packet whose boundaries do not match the first packet boundaries, or (2) the payload is shorter than 128 octets. They are introduced to simplify the fast path hardware. As a result, 97% of the payload octets are processed in the fast path, and 3% are processed in the slow path. This value is calculated using a TCP behavior simulation and an Internet traffic measurement report [11]. To limit the search delay, the matching system looks up in-flight data information using a one-level or a two-level array structure, depending on the CWND size, as shown in Figure 8. Each data array element holds multiple information blocks of fragments or packets. To hide the array access latency for the next packet in most of the cases, information of the last received packet is stored to SIB. 6. EVALUATION We evaluate the memory usage and throughput of the proposed system. We use the implementation described in Section 5. Xilinx XC2VP100-5 FPGA is used for evaluation. We use Xilinx ISE6.3i for synthesis and timing evaluation. The packet fingerprint bit width is 64. We used rule sets shown in Table 1. They are selected from the rule suite of Snort NIDS [1]. The rule size is limited by the FPGA capacity. To evaluate the proposed methods under the severest conditions, the longest rule of the Snort is included in all the rule sets. The longest rule length is 122 octets. Thus, 122

octets at fragment edges are saved. The number of switched state bits for the one-edge buffering scheme is at most 17 bits in all cases. For each rule sets, the input width is changed. We used traffic patterns generated by a network simulator for evaluation. Table 2 shows the detail of the traffic patterns. The packet size distribution is based on a measurement results of WAN networks [9]. We use pattern 0 as a baseline. We also use pattern 1 and pattern 2 when we investigate the effect of average packet size. Small packets (less than 128 octets) are not processed in the fast path. Therefore, the average size of packets processed in the fast path is greater than the average size of all the packets. Table 3 shows the usage of 18Kbit Block RAMs and slices on FPGA for each configuration. In the table, ”Full” indicates that the design does not fit in the FPGA. The twoedge buffering scheme requires more slices compared with the one-edge buffering scheme. This is because additional logics are necessary for two-edge buffering scheme to manipulate the ending edge octets of each packet. 6.1. Memory Usage Figure 9 shows the memory usage to process out-of-order packets. Traffic pattern 0 is used for the evaluation. For comparison, Figure 9 also shows the memory usage when all out-of-order packets are buffered. The proposed methods use less memory compared with the full buffering method. The memory usage is reduced to less than 8% compared with the buffering method. Table 4–5 shows the total memory usage. The link speed is assumed to be 10Gbps or 40Gbps, and the RTT is assumed to be 200ms. Therefore, the in-flight data size is 250MB when the link speed is 10Gbps, and 1GB when the link speed is 40Gbps. The number of streams is assumed to be 100K or 1M. This value is decided based on [9]. As shown in the tables, high-speed networks are sup-

48

Table 3. FPGA resource usage for each configuration (number of Block RAMs/number of slices) input width rule0 rule1 rule2 rule3 rule4 rule5 4 octets 12/2543 21/2734 32/4015 62/2491 139/2682 329/3067 One-edge 8 octets 22/3776 37/4122 55/4667 102/3730 239/4004 Full buffering 16 octets 41/6603 66/7293 102/8590 188/6638 433/7170 Full method 32 octets 78/12449 125/13815 194/16544 354/12714 Full Full 64 octets 150/24387 243/27256 Full Full Full Full 4 octets 12/3027 21/3125 32/3436 62/2920 139/3084 329/3560 Two-edge 8 octets 22/4356 37/4638 55/5247 102/4264 239/4452 Full buffering 16 octets 41/7302 66/7969 102/9213 188/7297 433/7813 Full method 32 octets 78/13528 125/14972 194/17644 354/13849 Full Full 64 octets 150/26453 243/29059 Full Full Full Full

Memory usage (Mbytes)

1000 100 10 1

buffering all packets two-edge buffering one-edge buffering

0.1

Table 5. Total memory usage for each traffic pattern(inflight data 1GB, 1M streams) Traffic Memory usage (MB) pattern One-edge buffering Two-edge buffering pattern0 125.2 129.9 pattern1 134.6 140.0 pattern2 123.1 127.6

0.01 1

4

16

64

256

1024

Total size of out-of-order packets (Mbytes)

Fig. 9. Memory usage for out-of-order packet handling Table 4. Total memory usage (traffic pattern 0) in-flight number of Memory usage (MB) data size streams 1-edge buffering 2-edge buffering 250MB 100K 21.0 22.2 1M 82.8 84.0 1GB 100K 63.4 68.1 1M 125.2 129.9

ported by the proposed methods with a practical memory usage. For traffic pattern 0, the necessary memory size is 129.9MB when the link speed is 40Gbps. The memory usage is larger when pattern 1 is used. This is because many packets are saved by the system when the average packet size is small. However, the usage is within a practical range.

ing analysis tool of Xilinx ISE 6.3i. The number of elapsed clock cycles is measured using a cycle accurate simulator combined with a TCP network simulator. Figure 10 shows the packet-processing throughput for each rule set. Traffic pattern 0 is used for the evaluation. When the one-edge buffering scheme is used, the maximum throughput is 1.85 – 39.3Gbps, depending on the rule size. To the best of our knowledge, the proposed system is the first TCP stream scanning system that has realized over 10Gbps throughput. When the two-edge buffering scheme is used, the throughput is decreased when the input width is large, because of additional memory accesses for edge octets. In this respect, one-edge buffering scheme is better than the two-edge buffering method when the input width is large. Figure 11 shows the packet processing throughput for each traffic pattern. The figure shows that the throughput increases as the average packet size is increased. This effect is large when the input width is large. This is because the performance is limited by an overhead time necessary to process each packet when the input width is large.

6.2. Packet Processing Throughput The packet-processing throughput of the proposed system is evaluated. Throughput is calculated from the maximum operating clock speed and the number of clock cycles necessary to process a set of packets. Maximum operating frequency is evaluated using the post place and route static tim-

7. RELATED WORKS Schuehler et al. proposed a TCP stream level pattern matching method [5], with a system consisting of a TCP stream reassembler called TCP Splitter [12] and a pattern matching unit using deterministic finite automaton (DFA). Since the

49

Throughput(Gbps)

50 rule0(507 char.) rule1(1017 char.) rule2(2047 char.) rule3(4090 char.) rule4(8192 char.) rule5(16375 char.)

40 30 20 10 0

4

8

16 32 Input width(octets)

64

One-edge buffering Throughput(Gbps)

50 rule0(507 char.) rule1(1017 char.) rule2(2047 char.) rule3(4090 char.) rule4(8192 char.) rule5(16375 char.)

40 30 20

ing method. Using this method, neither packet buffering nor packet dropping is necessary. To handle inconsistent retransmissions, we proposed the packet fingerprint method. Using this method, the matching system is able to discover and discard inconsistent packets with a small memory usage. In the evaluation, it is shown that the proposed method achieves both memory usage elimination and high throughput. In the evaluation result, memory usage is reduced to less than 8% of a full-buffering method. The evaluated throughput is 1.85 – 39.3Gbps, depending on the rule size. Our future work includes evaluating the proposed system using real network traffic. Another future work will be to apply a similar idea to regular expression matching. 9. REFERENCES

10

[1] M. Roesch, “Snort - Lightweight Intrusion Detection for Networks,” in Proc. of Lisa’99: 13th Administration Conference, 1999.

0 4

8

16 32 Input width(octets)

64

[2] “Enterasys Intrusion Defense,” February http://www.enterasys.com/products/ids/.

Two-edge buffering

[3] M. Necker, D. Contis, D. Schimmel, “Tcp-stream reassembly and state tracking in hardware,” in Proc. of 10 th Annual IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM’02), September 2002, pp. 286–287.

40 30

pattern0 pattern1 pattern2

20 10 0 16 32 64 Input width(octets)

One-edge buffering

Throughput(Gbps)

Throughput(Gbps)

Fig. 10. Packet processing throughput (pattern 0) 50

50 pattern0 pattern1 pattern2

40 30

[4] S. Li, J. Tørresen, and O. Sør˚asen, “Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware,” in Proc. of 13th Intl. Conf. on Field Programmable Logic and Applications(FPL ’03), September 2003, pp. 1153–1157.

20 10

[5] D. V. Schuehler, J. Moscola, J. Lockwood, “Architecture for a Hardware Based, TCP/IP Content Scanning System,” in Proc. of 11th IEEE Symp. on High Performance Interconnects (HotI ’03), August 2003, pp. 89 – 94.

0 16 32 64 Input width(octets)

Two-edge buffering

[6] Kevin Fall and Sally Floyd, “Simulation-based comparisons of Tahoe, Reno and SACK TCP,” Computer Communication Review, vol. 26, no. 3, pp. 5–21, July 1996.

Fig. 11. Packet processing throughput for each traffic pattern (rule1) TCP Splitter drops all the out-of-order packets, the communication throughput is reduced because of the TCP congestion control. Li et al. proposed a TCP stream reassembler to combine with a string matching unit [4]. Necker et al. proposed a similar system [3]. Since these systems save all the out-of-order packets, a large memory is necessary to support fast networks. On the other hand, our method requires neither full buffering nor dropping of out-of-order packets.

2003,

[7] J. Postel, “RFC793 : September 1981.

Transmission Control Protocol,”

[8] M. Handley, V. Paxson, “Network Intrusion Detection : Evasion, Traffic Normalization, and End-to-End Protocol Semantics,” in Proc. of 10th USENIX Security Symposium, August 2001. [9] “IP Monitoring Project,” http://ipmon.sprint.com/ipmon.php. [10] Y. Sugawara, M. Inaba, K. Hiraki, “Over 10gbps string matching mechanism for multi-stream packet scanning systems,” in Proc. of 14th Intl. Conf. on Field Programmable Logic and Applications(FPL ’04), August 2004.

8. CONCLUSION

[11] K. Thompson, G. J. Miller, and F. Wilder, “Wire-area internet traffic patterns and characteristics,” IEEE Network Magazine, vol. 11, no. 6, pp. 10–23, November/December 1997.

We have proposed a high-speed and memory efficient TCP stream level string matching system that handles packet loss and inconsistent retransmission without dropping packets. To handle packet loss, we proposed the out-of-order match-

[12] D. V. Schuehler, J. Lockwood, “TCP-Splitter: A TCP Flow Monitor in Reconfigurable Hardware,” in Proc. of High Performance Interconnects (HotI ’02), August 2002, pp. 127– 131.

50