内容简介:When writing services that accept TCP connections, we tend to think of our work as starting from the point where our service accepts a new client connection and finishing when we complete the request and close the socket. For services at scale, operations
When writing services that accept TCP connections, we tend to think of our work as starting from the point where our service accepts a new client connection and finishing when we complete the request and close the socket. For services at scale, operations can happen at such a high rate that some of the default resource limits of the Linux kernel can break this abstraction and start causing impact to incoming connections outside of that connection lifecycle. This post focuses on some standard resource limitations that exist before the client socket is handed to the application - all of which came up during the course of investigating errors on production systems as part of my role at GitHub (in some cases, multiple times across different applications).
In its most basic form (ignoring non-blocking variants), listening for TCP connections requires a call to listen()
to actually start allowing incoming connections, followed by repeated calls to accept()
to take the next pending connection and return a file descriptor that is for that particular client. In C this pattern looks something like:
int server_fd = socket(AF_INET, SOCK_STREAM, 0); bind(server_fd, /* ... */); // start listening for connections listen(server_fd, 512); while (running) { // block until a connection arrives and then accept it int client_fd = accept(server_fd, NULL, NULL); // ... handle client_fd ... } close(server_fd);
This is often hidden behind further layers of abstraction, and we tend to hide away all the implementation details of accepting connections and view it as a stream of new connections that we pick up and then process in parallel. However, when building a system that runs at a certain scale, this abstraction tends to break down because there are resource limitations introduced in the period where connections are being established but are not yet returned by accept()
. Before that point, those client connections are considered as being part of the server/listen socket and not as independent resources exposed to the application.
The examples mentioned in this blog post will be reproducible in the lab from theojulienne/blog-lab-scaling-accept - clone this repository bring up the lab, then poke around at these examples in a real system:
$ git clone https://github.com/theojulienne/blog-lab-scaling-accept.git $ cd blog-lab-scaling-accept $ vagrant up $ vagrant ssh
From SYN to accept()
The Linux kernel maintains 2 queues of connections that maintain the backlog of connections that are not yet accept()
ed by the application:
When a SYN packet is received to initiate a new connection to a listen socket, a SYN-ACK is sent to the client and the half-completed connection state is stored in the “SYN backlog” or “Request Socket Queue”. This represents connections that have not yet been fully validated as having two-way communication between the hosts, because we haven’t yet validated that the remote end has received a packet from us (the SYN could have come from another host spoofing the source IP).
Once the client responds to the server’s SYN-ACK with an ACK, the connection has then completed the full TCP 3-way handshake , and the server knows that two-way communication has been established. At this point, the client connection is ready to be provided to the application at a future call to accept()
, and is added to the appropriately named “accept queue”.
SYN backlog
Connections in the SYN backlog remain there for a period of time relative to the Round Trip Time between the server and the client. If there are N slots in the backlog then you can have at most N connections in this backlog per average RTT, after which the backlog overflows.
This won’t actually cause the connections to fail by default on Linux, but cause SYN cookies to be sent instead. This is because when only a SYN packet has been received, the server hasn’t yet validated that the client is who they say they are, so they could be spoofing packets as coming from a different IP. This is a common Denial of Service attack called a SYN Flood . Because this is so common, the Linux kernel has built in mitigation to SYN floods by sending a SYN cookie when there is no room in the SYN backlog for the new connection.
This means that even if the SYN backlog overflows under normal circumstances with no DoS attack, the kernel allows the connection to move further along in the handshake, and won’t store any resources for the connection until an ACK completes the handshake. Sending SYN cookies during normal circumstances does however indicate that the rate of connections is probably too high for the default limits - SYN cookies are really only meant for mitigating SYN floods.
Whether SYN cookies are enabled is controlled via the sysctl net.ipv4.tcp_syncookies
, by default this is set to 1
which indicates that SYN cookies should be sent when the SYN backlog overflows, but can also be set to 0
to disable entirely or 2
to force SYN cookies to be sent 100% of the time.
When SYN cookies are enabled as needed (the default), they are triggered when the number of pending connections in the SYN backlog is more than the configured accept queue backlog size for the socket - the logic for this is here . When SYN cookies are fully disabled, net.ipv4.tcp_max_syn_backlog
configures the number of connections allowed in the SYN backlog separately - the logic for this case is here .
In the default configuration where the backlog overflows and SYN cookies are sent, the kernel increments the TCPReqQFullDoCookies
counter and logs this line to the kernel log, which is often confused for an indicator of a real SYN flood even when it’s just due to legitimate connections coming in too fast:
TCP: request_sock_TCP: Possible SYN flooding on port 8080. Sending cookies. Check SNMP counters.
If SYN cookies are explicitly disabled, the kernel increments the TCPReqQFullDrop
counter and the following would be logged instead:
TCP: request_sock_TCP: Possible SYN flooding on port 8080. Dropping request. Check SNMP counters.
If you see this message for internal service-to-service connections, chances are you’re dealing with a scaling problem or a thundering herd problem , and not an actual SYN flood or intentional bad actor.
Those “SNMP Counters” that the kernel mentions are available in nstat
in the network namespace:
$ nstat -a | grep TCPReqQFull TcpExtTCPReqQFullDoCookies 706 0.0
SYN backlog in the lab
To see how the SYN backlog behaves in the lab, we can simulate latency on local connections with the following:
vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh vagrant@blog-lab-scaling-accept:~$ sudo tc qdisc add dev lo root netem delay 200ms vagrant@blog-lab-scaling-accept:~$ ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=401 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=400 ms
This means we can now simulate overflowing the SYN backlog. To make it a bit easier to reproduce, reduce the size of the SYN backlog from the default of 128
by reducing net.core.somaxconn
(because SYN cookies are enabled):
vagrant@blog-lab-scaling-accept:~$ sudo sysctl net.core.somaxconn net.core.somaxconn = 128 vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=10 net.core.somaxconn = 10 vagrant@blog-lab-scaling-accept:~$ sudo systemctl restart nginx vagrant@blog-lab-scaling-accept:~$
At this point, if more than 10 connections arrive within 400ms (before the SYN-ACK and ACK handshake complete), then 10 connections will be in the SYN backlog, which is the maximum configured. This will trigger SYN cookies to be sent, triggering the message and counters above. Let’s test out if this works, run a simulation that opens N concurrent connections:
vagrant@blog-lab-scaling-accept:~$ python /vagrant/test_send_concurrent_connections.py 127.0.0.1:80 20 Waiting, Ctrl+C to exit.
And verify that SYN cookies were sent as expected:
vagrant@blog-lab-scaling-accept:~$ sudo dmesg -c [ 2571.784749] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters. vagrant@blog-lab-scaling-accept:~$ sudo nstat | grep ReqQ TcpExtTCPReqQFullDoCookies 10 0.0 vagrant@blog-lab-scaling-accept:~$
Try disabling the simulated latency with sudo tc qdisc del dev lo root
and re-run the same test, now connections move through fast enough that SYN cookies are not sent.
This also demonstrates the way the Round Trip Time effects the amount of connections that can burst into a LISTEN socket - test_send_concurrent_connections.py
does actually attempt to initiate complete connections and nginx
on the other end is readily calling accept()
as fast as it can, but because 20 connections are initiated at once, 20 SYN packets arrive before any handshake can continue, and the SYN backlog overflows. You can imagine that in the real world, with some of these values often defaulting to 128
, and users often being far away from servers (on the other side of the world), it’s pretty easy to accidentally trigger this scenario without a true SYN flood.
Accept queue
Once an ACK packet is received and validated, a new client connection is ready for the application to process. This connection is moved into the accept queue, waiting for the application to call accept()
and receive it.
Unlike the SYN backlog, the accept queue has no backup plan for when it overflows. Back in the original call to listen()
, a backlog
argument was provided, which indicates how many connections can have completed the 3-way handshake and be waiting in the kernel for the application to accept.
This is the first common issue: If the backlog
number provided to listen()
is not large enough to contain any number of connections that could reasonably complete their handshake between 2 calls to accept()
, then a connection will be dropped on the floor, and the application generally won’t even notice - the next call to accept()
will succeed without any indication of a dropped connection! This is particularly likely to happen when some reasonable amount of work can happen between calls to accept()
or when the incoming connections tend to arrive at the same time (such as jobs running on many servers with cron, or a thundering herd of reconnection attempts).
However, even when you specify a high enough backlog
value to listen()
, there’s another location that silently limits this value. The net.core.somaxconn
sysctl specifies a network-system-wide maximum for the backlog
of any socket. When a larger backlog
value is provided, the kernel silently caps it at the value of net.core.somaxconn
. This is the next most common issue: Both net.core.somaxconn
and the backlog
argument to listen()
need to be adjusted appropriately so that the backlog is actually adjusted as expected.
As an added complexity, the net.core.somaxconn
sysctl is not system global, but is global to a Linux network namespace. For new network namespaces, such as used by most Docker containers launched, the values from the default network namespace are not inherited, but instead set to the built-in kernel default:
vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh vagrant@blog-lab-scaling-accept:~$ sudo sysctl net.core.somaxconn net.core.somaxconn = 128 vagrant@blog-lab-scaling-accept:~$ sudo docker run -it ubuntu sysctl net.core.somaxconn net.core.somaxconn = 128 vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=1024 net.core.somaxconn = 1024 vagrant@blog-lab-scaling-accept:~$ sudo docker run -it ubuntu sysctl net.core.somaxconn net.core.somaxconn = 128 vagrant@blog-lab-scaling-accept:~$
Therein lies the third common issue: Running in containers that have their own network namespace (which is most of them launched by Kubernetes/Docker), even if the system has net.core.somaxconn
correctly tweaked, that value will be ignored, and so the container must also have net.core.somaxconn
tweaked to match the application running inside of it.
If any of these situations occurs and causes impact, they are visible in the ListenOverflows
counter:
$ nstat -a | grep ListenOverflows TcpExtListenOverflows 6811 0.0
Accept queue in the lab
To see how the accept queue behaves in the lab, we can run a server that is bad at accepting connections (it sleeps a lot) with a backlog of 10:
vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=1024 net.core.somaxconn = 1024 vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 10 Listening with backlog 10
In another window, send a bunch of connections to the laggy server, then exit after a few seconds:
vagrant@blog-lab-scaling-accept:~$ python /vagrant/test_send_concurrent_connections.py 127.0.0.1:8080 20 Waiting, Ctrl+C to exit. (wait a few seconds) ^C vagrant@blog-lab-scaling-accept:~$ sudo nstat | grep ListenOverflows TcpExtListenOverflows 44 0.0 vagrant@blog-lab-scaling-accept:~$
At this point, we can see how important the backlog argument to listen()
is. Next, let’s review how somaxconn
caps this by setting up a server with a longer backlog, that we expect to be capped:
vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=10 net.core.somaxconn = 10 vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 1024 Listening with backlog 1024
All looks good - notice how we effectively called listen(1024)
and nothing went wrong. Running the same commands as we did when we provided the small backlog
value, we can observe the same problem when the backlog is silently truncated by the kernel:
vagrant@blog-lab-scaling-accept:~$ python /vagrant/test_send_concurrent_connections.py 127.0.0.1:8080 20 Waiting, Ctrl+C to exit. (wait a few seconds) ^C vagrant@blog-lab-scaling-accept:~$ sudo nstat | grep ListenO TcpExtListenOverflows 77 0.0 vagrant@blog-lab-scaling-accept:~$
Monitoring counters
A few counters were mentioned above that can be read with nstat
:
TcpExtTCPReqQFullDoCookies TcpExtTCPReqQFullDrop TcpExtListenOverflows
Because these counters are all part of the network namespace and are global within a namespace, no information about which socket or application caused the issue will be visible, and if the process is in a container, it will not be visible in the base system counters (in the default network namespace). Because of this, every network namespace will need to be inspected/monitored for these counters, and then additional work needs to be done to trace this back to the application, potentially using the kernel log lines as a hint in the case of the SYN backlog variant.
Detecting backlog overflows with tracing
The metrics available from the counters above provide a minimal picture of backlog overflows and dropped connections, but ideally we would be able to inspect this situation system-wide, seeing through any network namespaces, and be able to link it back to an application and even socket/port.
This is possible using kprobes and eBPF tracing using bcc , by hooking kernel functions that handle these failure cases and inspecting the context at that point in time. This allows us to extract realtime data from systems in production, unlike the counters which are at best vague indicators of an underlying issue existing somewhere on the system.
We know that the ListenOverflows
counter is incremented any time the accept backlog overflows - we can start there and work back to building a tracing program.
Searching the Linux source code for ListenOverflows shows what this counter is called internally - LINUX_MIB_LISTENOVERFLOWS
. Searching the kernel source tree shows all the places in the kernel that increment that counter - in this case, the best candidates are tcp_conn_request
in net/ipv4/tcp_input.c
and tcp_v4_syn_recv_sock
in net/ipv4/tcp_ipv4.c
. They handle slightly different points in the connection lifecycle:
tcp_conn_request tcp_v4_syn_recv_sock
These map to the earlier diagram:
tcp_conn_request
drops connections if the accept queue is full as a safety mechanism, even though it wouldn’t be adding to it yet. tcp_v4_syn_recv_sock
also drops connections if the accept queue is full, and rightly so, since it would be adding to it. If a SYN packet has already been accepted and added to the SYN backlog while the accept queue had available space, but was full by the point the ACK arrived, drops will occur in tcp_v4_syn_recv_sock
when the ACK is received. If a new SYN arrives while the accept queue is full, then tcp_conn_request
will drop instead. For local testing in a VM, tcp_conn_request
is the easier one to test since it doesn’t require careful timing between SYN and ACK to reproduce.
The code path that increments the ListenOverflows
counter in each of those functions looks like the following:
int tcp_conn_request(struct request_sock_ops *rsk_ops, const struct tcp_request_sock_ops *af_ops, struct sock *sk, struct sk_buff *skb) { /* ... */ if (sk_acceptq_is_full(sk)) { NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS); goto drop; } /* ... */ drop: tcp_listendrop(sk); return 0; } struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb, struct request_sock *req, struct dst_entry *dst, struct request_sock *req_unhash, bool *own_req) { /* ... */ if (sk_acceptq_is_full(sk)) goto exit_overflow; /* ... */ exit_overflow: NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS); /* ... */ }
The function called to check for overflow is a simple inlined check:
static inline bool sk_acceptq_is_full(const struct sock *sk) { return READ_ONCE(sk->sk_ack_backlog) > READ_ONCE(sk->sk_max_ack_backlog); }
When looking to trace functions that are inlined like this, the easiest way to hook the condition is to attach to the start of the calling function (in this case tcp_conn_request
and tcp_v4_syn_recv_sock
) which is not inlined, and perform the same conditional check in our eBPF code. We can do this with kprobes:
/* generic handler */ static inline void handle_sk_potential_overflow(struct pt_regs *ctx, int syn_recv, const struct sock *sk) { /* we need to read these using bpf_probe_read to ensure the read is safe */ u32 sk_ack_backlog = 0; u32 sk_max_ack_backlog = 0; bpf_probe_read(&sk_ack_backlog, sizeof(sk_ack_backlog), (void *)&sk->sk_ack_backlog); bpf_probe_read(&sk_max_ack_backlog, sizeof(sk_max_ack_backlog), (void *)&sk->sk_max_ack_backlog); if (sk_ack_backlog > sk_max_ack_backlog) { /* handle the condition */ } } /* when a SYN arrives */ struct tcp_request_sock_ops; void kprobe__tcp_conn_request(struct pt_regs *ctx, struct request_sock_ops *rsk_ops, const struct tcp_request_sock_ops *af_ops, const struct sock *sk, struct sk_buff *skb) { handle_sk_potential_overflow(ctx, 0, sk); } /* when an ACK arrives for a SYN_RECV socket */ void kprobe__tcp_v4_syn_recv_sock(struct pt_regs *ctx, const struct sock *sk) /* no need for the remaining unused args */ { handle_sk_potential_overflow(ctx, 1, sk); }
At this point, we have a hook on the same condition that would trigger incrementing the ListenOverflows
counter in TCP over IPv4. This can be fleshed out to read detailed information about the socket and return it to the userspace tracing program.
Tracing backlog overflows in the lab
One such program is included in the lab, let’s start up our laggy server again:
vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 10 Listening with backlog 10
In another session, start up the tracing program:
vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_backlog_overflow.py CONTAINER/HOST PID PROCESS NETNSID BIND PKT BL MAX
And finally, run our client:
vagrant@blog-lab-scaling-accept:~$ python /vagrant/test_send_concurrent_connections.py 127.0.0.1:8080 20 Waiting, Ctrl+C to exit.
The result will be traces showing where the backlog overflowed the configured maximum, along with details of the process and whether it occured during SYN or ACK:
vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_backlog_overflow.py CONTAINER/HOST PID PROCESS NETNSID BIND PKT BL MAX blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 SYN 11 10
As discussed earlier, if many SYN packets are queued while there is room in the accept queue, but then the queue overflows anyway, the ACK packets trigger it instead (run the client again in another window to see this output):
vagrant@blog-lab-scaling-accept:~$ sudo tc qdisc add dev lo root netem delay 200ms vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_backlog_overflow.py CONTAINER/HOST PID PROCESS NETNSID BIND PKT BL MAX blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10 blog-lab-scaling-acc 14742 python /vagrant/lagg 4026531992 127.0.0.1:8080 ACK 11 10
Detecting misconfigured listen sockets with tracing
We can also trace calls to listen()
to observe when the system places caps on the backlog size, to see a leading indicator for applications launching on the machine that have misconfigured values for the backlog or somaxconn
. This will only work at the time that applications start up, since we’ll be hooking listen()
as it is actually called, which is typically just when the application first launches.
The easiest way to find the kernel source responsible for this is to start from the listen()
syscall, which is __sys_listen()
in the kernel :
int __sys_listen(int fd, int backlog) { struct socket *sock; int err, fput_needed; int somaxconn; sock = sockfd_lookup_light(fd, &err, &fput_needed); if (sock) { somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn; if ((unsigned int)backlog > somaxconn) backlog = somaxconn; err = security_socket_listen(sock, backlog); if (!err) err = sock->ops->listen(sock, backlog); fput_light(sock->file, fput_needed); } return err; }
From this snippet alone, it’s easy to observe the way that somaxconn ( sysctl_somaxconn
) from the current network namespace ( sock_net(sock->sk)
) silently limits the backlog. By tracing the listen
syscall itself, we’ll get the originally requested backlog
, but to see what the limit was eventually set to we need to hook a point in time after that limit (we could also read the sysctl ourselves and repeat the logic).
Conveniently, sock->ops->listen(...)
calls the underlying listen operation in the address family, in this case inet_listen
:
int inet_listen(struct socket *sock, int backlog) { struct sock *sk = sock->sk; /* ... */ }
At this point, inet_listen
is called with the capped backlog. By combining a trace on the listen()
syscall itself with a trace on inet_listen
, we can map the original requested value to the actual value used:
BPF_HASH(requested_backlogs, u32, u64); TRACEPOINT_PROBE(syscalls, sys_enter_listen) { u32 pid = bpf_get_current_pid_tgid(); u64 backlog = args->backlog; requested_backlogs.update(&pid, &backlog); return 0; } void kprobe__inet_listen(struct pt_regs *ctx, struct socket *sock, int backlog) { struct sock *sk = sock->sk; u32 pid = bpf_get_current_pid_tgid(); u64 *requested_backlog; requested_backlog = requested_backlogs.lookup(&pid); if (requested_backlog == NULL) { return; } requested_backlogs.delete(&pid); /* `*requested_backlogs` contains the argument to listen() `backlog` contains the capped backlog */ }
With some extra work we can also peek inside the network namespace to see the value of somaxconn
at the time of the call. From above, we can use sock_net(sock->sk)->core.sysctl_somaxconn
if we have a struct socket *sock
- or since we now have the internal struct sock *sk
, we’ll want sock_net(sk)->core.sysctl_somaxconn
. The sock_net
function is pretty simple :
static inline struct net *sock_net(const struct sock *sk) { return read_pnet(&sk->sk_net); } /* ... elsewhere */ static inline struct net *read_pnet(const possible_net_t *pnet) { #ifdef CONFIG_NET_NS return pnet->net; #else return &init_net; #endif }
We’ll need to simulate that function chain with a call to bpf_probe_read
to ensure it’s safe to read that member, otherwise the eBPF validator will complain because the dereferencing is a bit too complex for bcc
to automatically safeguard with a bpf_probe_read
call. That gives us the following:
static inline struct net *safe_sock_net(const struct sock *sk) { #ifdef CONFIG_NET_NS struct net *ret; // was: read_pnet(&sk->sk_net)->net bpf_probe_read(&ret, sizeof(ret), (void *)&(sk->sk_net.net)); return ret; #else return &init_net; #endif } void kprobe__inet_listen(struct pt_regs *ctx, struct socket *sock, int backlog) { struct sock *sk = sock->sk; /* ... */ u32 netns_somaxconn = 0; bpf_probe_read(&netns_somaxconn, sizeof(netns_somaxconn), (void *)&(safe_sock_net(sk)->core.sysctl_somaxconn)); /* ... */ }
At this point we have the requested backlog (the argument to listen()
), the actual backlog calculated (after any somaxconn capping) and the value for somaxconn
as well. We can fill this out with further information about the pid and namespace and provide it to the userspace side of the tracing app.
Tracing accept queue overflows in the lab
A fleshed out tracing app for the above is also included in the lab, this time we need to start the tracing app before our listen()
ing server:
vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_listen_backlog_capped.py CONTAINER/HOST PID PROCESS NETNSID BIND REQ MAX ACTUAL
Now we can start up our server with a small backlog, then with a larger backlog, and then with a backlog limited by somaxconn
:
vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 10 Listening with backlog 10 ^C vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 128 Listening with backlog 128 ^C vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 1024 Listening with backlog 1024 ^C
The tracing app will show all these cases, along with where it was silently limited:
vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_listen_backlog_capped.py CONTAINER/HOST PID PROCESS NETNSID BIND REQ MAX ACTUAL blog-lab-scaling-acc 14849 python /vagrant/lagg 4026531992 127.0.0.1:8080 10 128 10 blog-lab-scaling-acc 14851 python /vagrant/lagg 4026531992 127.0.0.1:8080 128 128 128 blog-lab-scaling-acc 14853 python /vagrant/lagg 4026531992 127.0.0.1:8080 1024 128 128 LIMITED
This also works through container network namespaces:
vagrant@blog-lab-scaling-accept:~$ sudo docker run -v /vagrant:/vagrant -it python:2.7 python /vagrant/laggy_server.py 1024 Listening with backlog 1024 ^C vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=1024 net.core.somaxconn = 1024 vagrant@blog-lab-scaling-accept:~$ sudo docker run -v /vagrant:/vagrant -it python:2.7 python /vagrant/laggy_server.py 1024 Listening with backlog 1024 ^C vagrant@blog-lab-scaling-accept:~$ sudo docker run --sysctl net.core.somaxconn=1024 -v /vagrant:/vagrant -it python:2.7 python /vagrant/laggy_server.py 1024 Listening with backlog 1024 ^C vagrant@blog-lab-scaling-accept:~$
Note that setting the somaxconn
value on the base system had no effect, but setting the value in the container with docker run --sysctl
did change it:
vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_listen_backlog_capped.py CONTAINER/HOST PID PROCESS NETNSID BIND REQ MAX ACTUAL 380944436fe3 15274 python /vagrant/lagg 4026532171 127.0.0.1:8080 1024 128 128 LIMITED 9e82132ad3d6 15387 python /vagrant/lagg 4026532171 127.0.0.1:8080 1024 128 128 LIMITED af3b584488ea 15495 python /vagrant/lagg 4026532171 127.0.0.1:8080 1024 1024 1024
Wrapping up
We’ve seen how scaling something as simple as a TCP service can run into unexpected resource limitations, even before the program accepts a connection. Key takeaways are:
- The
backlog
argument tolisten
should always be set high enough so that any valid burst of connections that could arrive between calls toaccept()
will be queued up -
net.core.somaxconn
must be set high enough that anybacklog
value is not restricted - if not, it will be silently capped -
net.core.somaxconn
is namespaced, so must be set in the namespace/container that the socket is created in - SYN flood warnings are often logged due to the SYN backlog being too small, and it should be treated as a potential scaling issue unless a real SYN flood is suspected
- The number of connections in the SYN backlog are relative to the RTT between the client and server, not just the rate of connections
The tracing tools provided here are fairly lightweight in kernel space and are mostly gated to failure cases and hook relatively low rate system calls, making them reasonably safe to use on production systems, unless CPU resource usage is an active bottleneck at the time. These tracing tools may help track down which application or socket the ListenOverflow
or TCPReqQFull*
counters are being triggered by, as well as monitoring applications launching for any misconfigurations without needing to audit the configuration of each one individually.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
随机密码生成器
多种字符组合密码
XML、JSON 在线转换
在线XML、JSON转换工具