Resource Modification On Multicore Server With Kernel Bypass

Technology develops very fast marked by many innovations both from hardware and software. Multicore servers with a growing number of cores require efficient software. Kernel and Hardware used to handle various operational needs have some limitations. This limitation is due to the high level of complexity especially in handling as a server such as single socket discriptor, single IRQ and lack of pooling so that it requires some modifications. The Kernel Bypass is one of the methods to overcome the deficiencies of the kernel. Modifications on this server are a combination increase throughput and decrease server latency. Modifications at the driver level with hashing rx signal and multiple receives modification with multiple ip receivers, multiple thread receivers and multiple port listener used to increase throughput. Modifications using pooling principles at either the kernel level or the program level are used to decrease the latency. This combination of modifications makes the server more reliable with an average throughput increase of 250.44% and a decrease in latency 65.83%. Keywords— hash rx, multiple ip, multiple port, multiple thread, pooling, kernel bypass  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 14, No. 4, October 2020 : 331 – 340 332


INTRODUCTION
states several important issues to improve performance on multicore systems [1]. Improved performance can be done by avoiding migration in multicore systems and designing special techniques, tools or in the form of scheduling in the task / process / thread accompanied by affinity (mapping between software and hardware) to carry out the process right on the specific core. The literature survey shows that in the default operating system there is a problem that there is a transfer or migration of processes on several different threads and requires some development and appropriate steps that result in an increase in hardware process speed .
Research by Bo (2016) [2], Diener et al. (2016) [3], Paul, Bhattacharjee and Rajesh (2014) [4], states that performance improvement is NUMA (Non-Uniform Memory Access). NUMA can improve performance by reducing thread transfer through partitioning memory. Contrary to some previous studies Majo & Gross, (2013) [5] shows the error of the researcher's view of the NUMA concept which actually adds to problems in internal memory and reduces process performance. Tang et al., (2013) [6] conducted NUMA research on gmail backend server and frontend google search, giving the same result, which was a decrease in performance. The decrease in performance is indicated by a data decrease of 15% with the AMD Barcelona platform on Gmail Backend and by 20% with Intel Westmere for Web-search frontend.
The system built is a modification to avoid migration in multicore systems as well as testing to provide data on the difference in processing speed between shared cores and dedicated cores that often exist in a multicore processor. Modifications made based on kernel deficiencies in the server computer into a better system are indicated by increased throughput and decreased latency. Modifications made include the driver level which is intended to modify the signal settings by modifying the hashing algorithm into a multi queue to overcome the shortcomings of a single IRQ in the kernel. Modifications are also made to the kernel base of the port listener so that there is no lock on the default kernel such as analysis (Rivera et al., 2014) [7] with binding sockets so that each port can be used simultaneously by all CPUs without locking and waiting processes that increase load from the processor. Modifications were also made to reduce latency with pooling techniques on data packets so that it is expected to be able to avoid context switching while addressing kernel shortages in flow control.

METHODS
The study began by examining the differences between shared cores and dedicated cores by creating a workload looping program so that a time difference graph in milli seconds (ms) is obtained. The results of determining the speed difference determine the server recommendations later among multicore developments that use shared core or known as Hyper Threading (HT) on Intel. The next step determines the amount of default server system throughput using a simple sender and receiver program. The default standard number is used as a parameter for the success of the modification in steps called kernel bypass later. Modifications that consist of hashing signal receive which is intended to divide the signal into several queues which by default are one queue on the ethernet card. Modifications are made by using multiple IP (Internet Protocol) receivers which are intended to divide the load and the interrupt into several parts so that the throughput is increased and to carry out affinity on the thread so it is called a multi-thread receiver. This modification also overcomes the limitation of accepting interrupts, which is almost impossible to accept data packets larger than each CPU processing because it is an interrupt for an ethernet or peripheral. The next modification is performed by using a bind port To increase server reliability, an additional measurement parameter is given, namely RTT (Round Trip Time) or the time required for the server to process the packet and return it. RTT is obtained by modifying existing sender and receiver programs that are used to indicate the time in milliseconds (ms). Modifications in this stage only use the principle of pooling. Pooling used is applying 2 types of pooling in the kernel by calling the so_busy_pooling function and using pooling inside code principle in the body program.

Modifications to Reduce Throughput
Resource Utilization Multicore Server with Kernel Bypass take over the main functions of the kernel on the network using a new method by trying the hashing technique on the signal that aims to divide the queue to be directed at the specific CPU as shown in Figure 1 but the compatibility of the ethernet card is main problem with this hashing method.

Figure 1.Hashing signal process
This modification is carried out with the rx-flow-hash command followed by the division tupple, i.e. ip address with the udp4 command and tupple port with the sdfn command as show in figure 2. In this paper command sdfn not support. As describe above compability of ethernet card is main problem to do more. The next modification uses multi IP (Internet Protocol) aimed at dividing the load and dividing the CPU load so as to overcome CPU limitations in processing or processing IRQ (Interupt Re Quest). This modification is common to perform and is not a new modification. In this study the modification was undertaken by adding a new ethernet card with different IPs and the sender was directed into 2 different Ips in figure 3 The concept of multi threading sender was adopted to make the next modification as multi threading receiver. Same concept diferrent object to process more througput with couple with multiple IP modification. This modification is performed with taskset -c 1,2 . this modification allows a decrease in throughput so that it requires a modification to the socket discriptor to avoid multiple cores locking describe in next modification.

Figure 4. Listener port modification Process
Resource Modification on Multicore Server with Kernel Bypass using a program to overcome the struggle for lock as shown in Figure 4 is using multi listener so that each thread uses a different listenner with the result that even though using the same port there is no bottleneck. This multi listenner modification is undertaken before the accept () function. Accepting connections generally does not fork or bind socket, so all cores use waiting mode. In this study also applied affinity to IRQ NICs as a companion so as to improve the performance of the process as carried out by (Tsai et al., 2012) [8].

2 Modifications to Reduce Latency
This paper use modified client server program to show latency. In client program send packets and measure the time until response back. Server waits for packets and echoes them back to the source. In the system design to reduce latency, modifications will be made to split the queue from interrupt rx to be directed to the CPU list which aims to reduce excessive interrupt handling on 1 (one) core. Interrupts will be placed sequentially on the CPU, RX queue # 0 to CPU # 0, RX queue # 1 to CPU # 1 and so on as shown in Figure 6. This is also another control besides the irq_balance solution. Modifications also use the pooling technique of network traffic to avoid context switching. Pooling settings are done to avoid increasing processing time when a process is undertaken by a different CPU. Pooling is performed in two stages, that is through pooling the kernel that has been integrated in the linux kernel 3.11 or SO_BUSY_POLL.
The next Pooling stage is the one that is implemented in the receiver's body program so that pooling becomes more effective. Pseudo Code show in figure 7. All modification is effective as indicated by a reduction in the average latency to the default of 65.83%.   Table 1 shows the overall results of the modifications made in increasing throughput. Modifications made can beat the NUMA method that has been used to increase throughput. The use of multiple receiver threads experiences obstacles as exemplified by (Rivera et al., 2014)[7], adopted to modification with binding port that can provide significant results in the process. The results of the overall modification as set out in table 1, if calculated in an average will result in an increase of 250.44%.  Table 2 shows the overall results of the modifications made in reducing latency. Modifications made after pinning and comparing using the so_busy_pool kernel. Pooling which are performed on the body server program avoid blocking on.

CONCLUSIONS
Systems developed using kernel bypass are able to increase server capabilities as shown by increasing throughput and decreasing latency compared to the default kernel. The results of the experiment, the increase in average throughput was 250.44% and the decrease in average latency was 65.83%.