netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf 0/4] bpf, sockmap: fixes stress testing and regression
@ 2021-10-11 19:16 John Fastabend
  2021-10-11 19:16 ` [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage John Fastabend
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: John Fastabend @ 2021-10-11 19:16 UTC (permalink / raw)
  To: bpf, netdev; +Cc: john.fastabend, daniel, joamaki, xiyou.wangcong

Attached are 4 patches that fix issues we found by either stress testing
or updating our CI to LTS kernels. We nearly have CI running on BPF tree
now so hopefully future regressions will be caught much earlier.

Thanks to Jussi for all the hard work tracking down issues and getting
stress testing/CI running.

First two patches are issues discovered by Jussi after writing a stess
testing tool.

The last two fix an issue noticed while reviewing patches and xlated
code paths also discovered by Jussi.


John Fastabend (3):
  bpf, sockmap: Remove unhash handler for BPF sockmap usage
  bpf, sockmap: Fix race in ingress receive verdict with redirect to
    self
  bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and
    colliding

Jussi Maki (1):
  bpf: sk skb data_end stomps register when src_reg = dst_reg

 include/net/strparser.h   | 20 +++++++++++++-
 net/core/filter.c         | 58 +++++++++++++++++++++++++++++++++++----
 net/ipv4/tcp_bpf.c        | 48 +++++++++++++++++++++++++++++++-
 net/strparser/strparser.c | 10 +------
 4 files changed, 119 insertions(+), 17 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage
  2021-10-11 19:16 [PATCH bpf 0/4] bpf, sockmap: fixes stress testing and regression John Fastabend
@ 2021-10-11 19:16 ` John Fastabend
  2021-10-19  7:17   ` Jakub Sitnicki
  2021-10-11 19:16 ` [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self John Fastabend
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2021-10-11 19:16 UTC (permalink / raw)
  To: bpf, netdev; +Cc: john.fastabend, daniel, joamaki, xiyou.wangcong

We do not need to handle unhash from BPF side we can simply wait for the
close to happen. The original concern was a socket could transition from
ESTABLISHED state to a new state while the BPF hook was still attached.
But, we convinced ourself this is no longer possible and we also
improved BPF sockmap to handle listen sockets so this is no longer a
problem.

More importantly though there are cases where unhash is called when data is
in the receive queue. The BPF unhash logic will flush this data which is
wrong. To be correct it should keep the data in the receive queue and allow
a receiving application to continue reading the data. This may happen when
tcp_abort is received for example. Instead of complicating the logic in
unhash simply moving all this to tcp_close hook solves this.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/ipv4/tcp_bpf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index d3e9386b493e..35dcfb04f53d 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -476,7 +476,6 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 				   struct proto *base)
 {
 	prot[TCP_BPF_BASE]			= *base;
-	prot[TCP_BPF_BASE].unhash		= sock_map_unhash;
 	prot[TCP_BPF_BASE].close		= sock_map_close;
 	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
 	prot[TCP_BPF_BASE].stream_memory_read	= tcp_bpf_stream_read;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self
  2021-10-11 19:16 [PATCH bpf 0/4] bpf, sockmap: fixes stress testing and regression John Fastabend
  2021-10-11 19:16 ` [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage John Fastabend
@ 2021-10-11 19:16 ` John Fastabend
  2021-10-19  9:16   ` Jakub Sitnicki
  2021-10-11 19:16 ` [PATCH bpf 3/4] bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding John Fastabend
  2021-10-11 19:16 ` [PATCH bpf 4/4] bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg John Fastabend
  3 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2021-10-11 19:16 UTC (permalink / raw)
  To: bpf, netdev; +Cc: john.fastabend, daniel, joamaki, xiyou.wangcong

A socket in a sockmap may have different combinations of programs
attached depending on configuration. There can be no programs in which
case the socket acts as a sink only. There can be a TX program in this
case a BPF program is attached to sending side, but no RX program is
attached. There can be an RX program only where sends have no BPF
program attached, but receives are hooked with BPF. And finally,
both TX and RX programs may be attached. Giving us the permutations,

 None, Tx, Rx, and TxRx

To date most of our use cases have been TX case being used as a fast
datapath to directly copy between local application and a userspace
proxy. Or Rx cases and TxRX applications that are operating an in
kernel based proxy. The traffic in the first case where we hook
applications into a userspace application looks like this,

  AppA  redirect   AppB
   Tx <-----------> Rx
   |                |
   +                +
   TCP <--> lo <--> TCP

In this case all traffic from AppA (after 3whs) is copied into the
AppB ingress queue and no traffic is ever on the TCP recieive_queue.

In the second case the application never receives, except in some
rare error cases, traffic on the actual user space socket. Instead
the send happens in the kernel.

           AppProxy       socket pool
       sk0 ------------->{sk1,sk2, skn}
        ^                      |
        |                      |
        |                      v
       ingress              lb egress
       TCP                  TCP

Here because traffic is never read off the socket with userspace
recv() APIs there is only ever one reader on the sk receive_queue.
Namely the BPF programs.

However, we've started to introduce a third configuration where the
BPF program on receive should process the data, but then the normal
case is to push the data into the receive queue of AppB.

       AppB
       recv()                (userspace)
     -----------------------
       tcp_bpf_recvmsg()     (kernel)
         |             |
         |             |
         |             |
       ingress_msgQ    |
         |             |
       RX_BPF          |
         |             |
         v             v
       sk->receive_queue


This is different from the App{A,B} redirect because traffic is
first received on the sk->receive_queue.

Now for the issue. The tcp_bpf_recvmsg() handler first checks the
ingress_msg queue for any data handled by the BPF rx program and
returned with PASS code so that it was enqueued on the ingress msg
queue. Then if no data exists on that queue it checks the socket
receive queue. Unfortunately, this is the same receive_queue the
BPF program is reading data off of. So we get a race. Its possible
for the recvmsg() hook to pull data off the receive_queue before
the BPF hook has a chance to read it. It typically happens when
an application is banging on recv() and getting EAGAINs. Until
they manage to race with the RX BPF program.

To fix this we note that before this patch at attach time when
the socket is loaded into the map we check if it needs a TX
program or just the base set of proto bpf hooks. Then it uses
the above general RX hook regardless of if we have a BPF program
attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And
to fix above race when an RX program is attached we use a new
hook that is nearly identical to the old one except now we
do not let the recv() call skip the RX BPF program. Now only
the BPF program pulls data from sk->receive_queue and recv()
only pulls data from the ingress msgQ post BPF program handling.

With this resolved our AppB from above has been up and running
for many hours without detecting any errors. We do this by
correlating counters in RX BPF events and the AppB to ensure
data is never skipping the BPF program. Selftests, was not
able to detect this because we only run them for a short
period of time on well ordered send/recvs so we don't get any
of the noise we see in real application environments.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/ipv4/tcp_bpf.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 35dcfb04f53d..0cc420c0e259 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -185,6 +185,41 @@ static int tcp_msg_wait_data(struct sock *sk, struct sk_psock *psock,
 	return ret;
 }
 
+static int tcp_bpf_recvmsg_parser(struct sock *sk,
+				  struct msghdr *msg,
+				  size_t len,
+				  int nonblock,
+				  int flags,
+				  int *addr_len)
+{
+	struct sk_psock *psock;
+	int copied;
+
+	if (unlikely(flags & MSG_ERRQUEUE))
+		return inet_recv_error(sk, msg, len, addr_len);
+
+	psock = sk_psock_get(sk);
+	if (unlikely(!psock))
+		return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+
+	lock_sock(sk);
+msg_bytes_ready:
+	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+	if (!copied) {
+		long timeo;
+		int data;
+
+		timeo = sock_rcvtimeo(sk, nonblock);
+		data = tcp_msg_wait_data(sk, psock, timeo);
+		if (data && !sk_psock_queue_empty(psock))
+			goto msg_bytes_ready;
+		copied = -EAGAIN;
+	}
+	release_sock(sk);
+	sk_psock_put(sk, psock);
+	return copied;
+}
+
 static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		    int nonblock, int flags, int *addr_len)
 {
@@ -465,6 +500,8 @@ enum {
 enum {
 	TCP_BPF_BASE,
 	TCP_BPF_TX,
+	TCP_BPF_RX,
+	TCP_BPF_TXRX,
 	TCP_BPF_NUM_CFGS,
 };
 
@@ -483,6 +520,12 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 	prot[TCP_BPF_TX]			= prot[TCP_BPF_BASE];
 	prot[TCP_BPF_TX].sendmsg		= tcp_bpf_sendmsg;
 	prot[TCP_BPF_TX].sendpage		= tcp_bpf_sendpage;
+
+	prot[TCP_BPF_RX]			= prot[TCP_BPF_BASE];
+	prot[TCP_BPF_RX].recvmsg		= tcp_bpf_recvmsg_parser;
+
+	prot[TCP_BPF_TXRX]			= prot[TCP_BPF_TX];
+	prot[TCP_BPF_TXRX].recvmsg		= tcp_bpf_recvmsg_parser;
 }
 
 static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops)
@@ -520,6 +563,10 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
 	int config = psock->progs.msg_parser   ? TCP_BPF_TX   : TCP_BPF_BASE;
 
+	if (psock->progs.stream_verdict || psock->progs.skb_verdict) {
+		config = (config == TCP_BPF_TX) ? TCP_BPF_TXRX : TCP_BPF_RX;
+	}
+
 	if (restore) {
 		if (inet_csk_has_ulp(sk)) {
 			/* TLS does not have an unhash proto in SW cases,
-- 
2.33.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf 3/4] bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
  2021-10-11 19:16 [PATCH bpf 0/4] bpf, sockmap: fixes stress testing and regression John Fastabend
  2021-10-11 19:16 ` [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage John Fastabend
  2021-10-11 19:16 ` [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self John Fastabend
@ 2021-10-11 19:16 ` John Fastabend
  2021-10-19 15:39   ` Jakub Sitnicki
  2021-10-11 19:16 ` [PATCH bpf 4/4] bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg John Fastabend
  3 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2021-10-11 19:16 UTC (permalink / raw)
  To: bpf, netdev; +Cc: john.fastabend, daniel, joamaki, xiyou.wangcong

Strparser is reusing the qdisc_skb_cb struct to stash the skb message
handling progress, e.g. offset and length of the skb. First this is
poorly named and inherits a struct from qdisc that doesn't reflect the
actual usage of cb[] at this layer.

But, more importantly strparser is using the following to access its
metadata.

(struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))

Where _strp_msg is defined as,

 struct _strp_msg {
        struct strp_msg            strp;                 /*     0     8 */
        int                        accum_len;            /*     8     4 */

        /* size: 12, cachelines: 1, members: 2 */
        /* last cacheline: 12 bytes */
 };

So we use 12 bytes of ->data[] in struct. However in BPF code running
parser and verdict the user has read capabilities into the data[]
array as well. Its not too problematic, but we should not be
exposing internal state to BPF program. If its really needed then we can
use the probe_read() APIs which allow reading kernel memory. And I don't
believe cb[] layer poses any API breakage by moving this around because
programs can't depend on cb[] across layers.

In order to fix another issue with a ctx rewrite we need to stash a temp
variable somewhere. To make this work cleanly this patch builds a cb
struct for sk_skb types called sk_skb_cb struct. Then we can use this
consistently in the strparser, sockmap space. Additionally we can
start allowing ->cb[] write access after this.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface"
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/strparser.h   | 16 +++++++++++++++-
 net/core/filter.c         | 22 ++++++++++++++++++++++
 net/strparser/strparser.c | 10 +---------
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/net/strparser.h b/include/net/strparser.h
index 1d20b98493a1..bec1439bd3be 100644
--- a/include/net/strparser.h
+++ b/include/net/strparser.h
@@ -54,10 +54,24 @@ struct strp_msg {
 	int offset;
 };
 
+struct _strp_msg {
+	/* Internal cb structure. struct strp_msg must be first for passing
+	 * to upper layer.
+	 */
+	struct strp_msg strp;
+	int accum_len;
+};
+
+struct sk_skb_cb {
+#define SK_SKB_CB_PRIV_LEN 20
+	unsigned char data[SK_SKB_CB_PRIV_LEN];
+	struct _strp_msg strp;
+};
+
 static inline struct strp_msg *strp_msg(struct sk_buff *skb)
 {
 	return (struct strp_msg *)((void *)skb->cb +
-		offsetof(struct qdisc_skb_cb, data));
+		offsetof(struct sk_skb_cb, strp));
 }
 
 /* Structure for an attached lower socket */
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e32cee2c469..23a9bf92b5bb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9761,11 +9761,33 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
 				     struct bpf_prog *prog, u32 *target_size)
 {
 	struct bpf_insn *insn = insn_buf;
+	int off;
 
 	switch (si->off) {
 	case offsetof(struct __sk_buff, data_end):
 		insn = bpf_convert_data_end_access(si, insn);
 		break;
+	case offsetof(struct __sk_buff, cb[0]) ...
+	     offsetofend(struct __sk_buff, cb[4]) - 1:
+		BUILD_BUG_ON(sizeof_field(struct sk_skb_cb, data) < 20);
+		BUILD_BUG_ON((offsetof(struct sk_buff, cb) +
+			      offsetof(struct sk_skb_cb, data)) %
+			     sizeof(__u64));
+
+		prog->cb_access = 1;
+		off  = si->off;
+		off -= offsetof(struct __sk_buff, cb[0]);
+		off += offsetof(struct sk_buff, cb);
+		off += offsetof(struct sk_skb_cb, data);
+		if (type == BPF_WRITE)
+			*insn++ = BPF_STX_MEM(BPF_SIZE(si->code), si->dst_reg,
+					      si->src_reg, off);
+		else
+			*insn++ = BPF_LDX_MEM(BPF_SIZE(si->code), si->dst_reg,
+					      si->src_reg, off);
+		break;
+
+
 	default:
 		return bpf_convert_ctx_access(type, si, insn_buf, prog,
 					      target_size);
diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 9c0343568d2a..1a72c67afed5 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -27,18 +27,10 @@
 
 static struct workqueue_struct *strp_wq;
 
-struct _strp_msg {
-	/* Internal cb structure. struct strp_msg must be first for passing
-	 * to upper layer.
-	 */
-	struct strp_msg strp;
-	int accum_len;
-};
-
 static inline struct _strp_msg *_strp_msg(struct sk_buff *skb)
 {
 	return (struct _strp_msg *)((void *)skb->cb +
-		offsetof(struct qdisc_skb_cb, data));
+		offsetof(struct sk_skb_cb, strp));
 }
 
 /* Lower lock held */
-- 
2.33.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf 4/4] bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
  2021-10-11 19:16 [PATCH bpf 0/4] bpf, sockmap: fixes stress testing and regression John Fastabend
                   ` (2 preceding siblings ...)
  2021-10-11 19:16 ` [PATCH bpf 3/4] bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding John Fastabend
@ 2021-10-11 19:16 ` John Fastabend
  2021-10-23 13:05   ` Jakub Sitnicki
  3 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2021-10-11 19:16 UTC (permalink / raw)
  To: bpf, netdev; +Cc: john.fastabend, daniel, joamaki, xiyou.wangcong

From: Jussi Maki <joamaki@gmail.com>

The current conversion of skb->data_end reads like this,

  ; data_end = (void*)(long)skb->data_end;
   559: (79) r1 = *(u64 *)(r2 +200)   ; r1  = skb->data
   560: (61) r11 = *(u32 *)(r2 +112)  ; r11 = skb->len
   561: (0f) r1 += r11
   562: (61) r11 = *(u32 *)(r2 +116)
   563: (1f) r1 -= r11

But similar to the case

 ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg"),

the code will read an incorrect skb->len when src == dst. In this case we
end up generating this xlated code.

  ; data_end = (void*)(long)skb->data_end;
   559: (79) r1 = *(u64 *)(r1 +200)   ; r1  = skb->data
   560: (61) r11 = *(u32 *)(r1 +112)  ; r11 = (skb->data)->len
   561: (0f) r1 += r11
   562: (61) r11 = *(u32 *)(r1 +116)
   563: (1f) r1 -= r11

where line 560 is the reading 4B of (skb->data + 112) instead of the
intended skb->len Here the skb pointer in r1 gets set to skb->data and
the later deref for skb->len ends up following skb->data instead of skb.

This fixes the issue similarly to the patch mentioned above by creating
an additional temporary variable and using to store the register when
dst_reg = src_reg. We name the variable bpf_temp_reg and place it in the
cb context for sk_skb. Then we restore from the temp to ensure nothing
is lost.

Fixes: 16137b09a66f2 ("bpf: Compute data_end dynamically with JIT code")
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/strparser.h |  4 ++++
 net/core/filter.c       | 36 ++++++++++++++++++++++++++++++------
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/net/strparser.h b/include/net/strparser.h
index bec1439bd3be..732b7097d78e 100644
--- a/include/net/strparser.h
+++ b/include/net/strparser.h
@@ -66,6 +66,10 @@ struct sk_skb_cb {
 #define SK_SKB_CB_PRIV_LEN 20
 	unsigned char data[SK_SKB_CB_PRIV_LEN];
 	struct _strp_msg strp;
+	/* temp_reg is a temporary register used for bpf_convert_data_end_access
+	 * when dst_reg == src_reg.
+	 */
+	u64 temp_reg;
 };
 
 static inline struct strp_msg *strp_msg(struct sk_buff *skb)
diff --git a/net/core/filter.c b/net/core/filter.c
index 23a9bf92b5bb..f4a63af45f00 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9735,22 +9735,46 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 static struct bpf_insn *bpf_convert_data_end_access(const struct bpf_insn *si,
 						    struct bpf_insn *insn)
 {
-	/* si->dst_reg = skb->data */
+	int reg;
+	int temp_reg_off = offsetof(struct sk_buff, cb) +
+			   offsetof(struct sk_skb_cb, temp_reg);
+
+	if (si->src_reg == si->dst_reg) {
+		/* We need an extra register, choose and save a register. */
+		reg = BPF_REG_9;
+		if (si->src_reg == reg || si->dst_reg == reg)
+			reg--;
+		if (si->src_reg == reg || si->dst_reg == reg)
+			reg--;
+		*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, reg, temp_reg_off);
+	} else {
+		reg = si->dst_reg;
+	}
+
+	/* reg = skb->data */
 	*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data),
-			      si->dst_reg, si->src_reg,
+			      reg, si->src_reg,
 			      offsetof(struct sk_buff, data));
 	/* AX = skb->len */
 	*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, len),
 			      BPF_REG_AX, si->src_reg,
 			      offsetof(struct sk_buff, len));
-	/* si->dst_reg = skb->data + skb->len */
-	*insn++ = BPF_ALU64_REG(BPF_ADD, si->dst_reg, BPF_REG_AX);
+	/* reg = skb->data + skb->len */
+	*insn++ = BPF_ALU64_REG(BPF_ADD, reg, BPF_REG_AX);
 	/* AX = skb->data_len */
 	*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data_len),
 			      BPF_REG_AX, si->src_reg,
 			      offsetof(struct sk_buff, data_len));
-	/* si->dst_reg = skb->data + skb->len - skb->data_len */
-	*insn++ = BPF_ALU64_REG(BPF_SUB, si->dst_reg, BPF_REG_AX);
+
+	/* reg = skb->data + skb->len - skb->data_len */
+	*insn++ = BPF_ALU64_REG(BPF_SUB, reg, BPF_REG_AX);
+
+	if (si->src_reg == si->dst_reg) {
+		/* Restore the saved register */
+		*insn++ = BPF_MOV64_REG(BPF_REG_AX, si->src_reg);
+		*insn++ = BPF_MOV64_REG(si->dst_reg, reg);
+		*insn++ = BPF_LDX_MEM(BPF_DW, reg, BPF_REG_AX, temp_reg_off);
+	}
 
 	return insn;
 }
-- 
2.33.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage
  2021-10-11 19:16 ` [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage John Fastabend
@ 2021-10-19  7:17   ` Jakub Sitnicki
  2021-10-20  5:28     ` John Fastabend
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Sitnicki @ 2021-10-19  7:17 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong

On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> We do not need to handle unhash from BPF side we can simply wait for the
> close to happen. The original concern was a socket could transition from
> ESTABLISHED state to a new state while the BPF hook was still attached.
> But, we convinced ourself this is no longer possible and we also
> improved BPF sockmap to handle listen sockets so this is no longer a
> problem.
>
> More importantly though there are cases where unhash is called when data is
> in the receive queue. The BPF unhash logic will flush this data which is
> wrong. To be correct it should keep the data in the receive queue and allow
> a receiving application to continue reading the data. This may happen when
> tcp_abort is received for example. Instead of complicating the logic in
> unhash simply moving all this to tcp_close hook solves this.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Doesn't this open the possibility of having a TCP_CLOSE socket in
sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
close it?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self
  2021-10-11 19:16 ` [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self John Fastabend
@ 2021-10-19  9:16   ` Jakub Sitnicki
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Sitnicki @ 2021-10-19  9:16 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong


On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> A socket in a sockmap may have different combinations of programs
> attached depending on configuration. There can be no programs in which
> case the socket acts as a sink only. There can be a TX program in this
> case a BPF program is attached to sending side, but no RX program is
> attached. There can be an RX program only where sends have no BPF
> program attached, but receives are hooked with BPF. And finally,
> both TX and RX programs may be attached. Giving us the permutations,
>
>  None, Tx, Rx, and TxRx
>
> To date most of our use cases have been TX case being used as a fast
> datapath to directly copy between local application and a userspace
> proxy. Or Rx cases and TxRX applications that are operating an in
> kernel based proxy. The traffic in the first case where we hook
> applications into a userspace application looks like this,
>
>   AppA  redirect   AppB
>    Tx <-----------> Rx
>    |                |
>    +                +
>    TCP <--> lo <--> TCP
>
> In this case all traffic from AppA (after 3whs) is copied into the
> AppB ingress queue and no traffic is ever on the TCP recieive_queue.
>
> In the second case the application never receives, except in some
> rare error cases, traffic on the actual user space socket. Instead
> the send happens in the kernel.
>
>            AppProxy       socket pool
>        sk0 ------------->{sk1,sk2, skn}
>         ^                      |
>         |                      |
>         |                      v
>        ingress              lb egress
>        TCP                  TCP
>
> Here because traffic is never read off the socket with userspace
> recv() APIs there is only ever one reader on the sk receive_queue.
> Namely the BPF programs.
>
> However, we've started to introduce a third configuration where the
> BPF program on receive should process the data, but then the normal
> case is to push the data into the receive queue of AppB.
>
>        AppB
>        recv()                (userspace)
>      -----------------------
>        tcp_bpf_recvmsg()     (kernel)
>          |             |
>          |             |
>          |             |
>        ingress_msgQ    |
>          |             |
>        RX_BPF          |
>          |             |
>          v             v
>        sk->receive_queue
>
>
> This is different from the App{A,B} redirect because traffic is
> first received on the sk->receive_queue.
>
> Now for the issue. The tcp_bpf_recvmsg() handler first checks the
> ingress_msg queue for any data handled by the BPF rx program and
> returned with PASS code so that it was enqueued on the ingress msg
> queue. Then if no data exists on that queue it checks the socket
> receive queue. Unfortunately, this is the same receive_queue the
> BPF program is reading data off of. So we get a race. Its possible
> for the recvmsg() hook to pull data off the receive_queue before
> the BPF hook has a chance to read it. It typically happens when
> an application is banging on recv() and getting EAGAINs. Until
> they manage to race with the RX BPF program.
>
> To fix this we note that before this patch at attach time when
> the socket is loaded into the map we check if it needs a TX
> program or just the base set of proto bpf hooks. Then it uses
> the above general RX hook regardless of if we have a BPF program
> attached at rx or not. This patch now extends this check to
> handle all cases enumerated above, TX, RX, TXRX, and none. And
> to fix above race when an RX program is attached we use a new
> hook that is nearly identical to the old one except now we
> do not let the recv() call skip the RX BPF program. Now only
> the BPF program pulls data from sk->receive_queue and recv()
> only pulls data from the ingress msgQ post BPF program handling.
>
> With this resolved our AppB from above has been up and running
> for many hours without detecting any errors. We do this by
> correlating counters in RX BPF events and the AppB to ensure
> data is never skipping the BPF program. Selftests, was not
> able to detect this because we only run them for a short
> period of time on well ordered send/recvs so we don't get any
> of the noise we see in real application environments.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Acked-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 3/4] bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
  2021-10-11 19:16 ` [PATCH bpf 3/4] bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding John Fastabend
@ 2021-10-19 15:39   ` Jakub Sitnicki
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Sitnicki @ 2021-10-19 15:39 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong

On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> Strparser is reusing the qdisc_skb_cb struct to stash the skb message
> handling progress, e.g. offset and length of the skb. First this is
> poorly named and inherits a struct from qdisc that doesn't reflect the
> actual usage of cb[] at this layer.
>
> But, more importantly strparser is using the following to access its
> metadata.
>
> (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))
>
> Where _strp_msg is defined as,
>
>  struct _strp_msg {
>         struct strp_msg            strp;                 /*     0     8 */
>         int                        accum_len;            /*     8     4 */
>
>         /* size: 12, cachelines: 1, members: 2 */
>         /* last cacheline: 12 bytes */
>  };
>
> So we use 12 bytes of ->data[] in struct. However in BPF code running
> parser and verdict the user has read capabilities into the data[]
> array as well. Its not too problematic, but we should not be
> exposing internal state to BPF program. If its really needed then we can
> use the probe_read() APIs which allow reading kernel memory. And I don't
> believe cb[] layer poses any API breakage by moving this around because
> programs can't depend on cb[] across layers.
>
> In order to fix another issue with a ctx rewrite we need to stash a temp
> variable somewhere. To make this work cleanly this patch builds a cb
> struct for sk_skb types called sk_skb_cb struct. Then we can use this
> consistently in the strparser, sockmap space. Additionally we can
> start allowing ->cb[] write access after this.
>
> Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface"
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/net/strparser.h   | 16 +++++++++++++++-
>  net/core/filter.c         | 22 ++++++++++++++++++++++
>  net/strparser/strparser.c | 10 +---------
>  3 files changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/strparser.h b/include/net/strparser.h
> index 1d20b98493a1..bec1439bd3be 100644
> --- a/include/net/strparser.h
> +++ b/include/net/strparser.h
> @@ -54,10 +54,24 @@ struct strp_msg {
>  	int offset;
>  };
>
> +struct _strp_msg {
> +	/* Internal cb structure. struct strp_msg must be first for passing
> +	 * to upper layer.
> +	 */
> +	struct strp_msg strp;
> +	int accum_len;
> +};
> +
> +struct sk_skb_cb {
> +#define SK_SKB_CB_PRIV_LEN 20

Nit: Would consider reusing BPF_SKB_CB_LEN from linux/filter.h.
net/bpf/test_run.c should probably use it too, instead of
QDISC_CB_PRIV_LEN.

> +	unsigned char data[SK_SKB_CB_PRIV_LEN];
> +	struct _strp_msg strp;
> +};
> +
>  static inline struct strp_msg *strp_msg(struct sk_buff *skb)
>  {
>  	return (struct strp_msg *)((void *)skb->cb +
> -		offsetof(struct qdisc_skb_cb, data));
> +		offsetof(struct sk_skb_cb, strp));
>  }
>
>  /* Structure for an attached lower socket */

[...]

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage
  2021-10-19  7:17   ` Jakub Sitnicki
@ 2021-10-20  5:28     ` John Fastabend
  2021-10-20 15:11       ` Jakub Sitnicki
  0 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2021-10-20  5:28 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong

Jakub Sitnicki wrote:
> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> > We do not need to handle unhash from BPF side we can simply wait for the
> > close to happen. The original concern was a socket could transition from
> > ESTABLISHED state to a new state while the BPF hook was still attached.
> > But, we convinced ourself this is no longer possible and we also
> > improved BPF sockmap to handle listen sockets so this is no longer a
> > problem.
> >
> > More importantly though there are cases where unhash is called when data is
> > in the receive queue. The BPF unhash logic will flush this data which is
> > wrong. To be correct it should keep the data in the receive queue and allow
> > a receiving application to continue reading the data. This may happen when
> > tcp_abort is received for example. Instead of complicating the logic in
> > unhash simply moving all this to tcp_close hook solves this.
> >
> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> 
> Doesn't this open the possibility of having a TCP_CLOSE socket in
> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
> close it?

Correct it means we may have TCP_CLOSE socket in the map. I'm not
seeing any problem with this though. A send on the socket would
fail the sk_state checks in the send hooks. (tcp.c:1245). Receiving
from the TCP stack would fail with normal TCP stack checks.

Maybe we want a check on redirect into ingress if the sock is in
ESTABLISHED state as well? I might push that in its own patch
though it seems related, but I think we should have that there
regardless of this patch.

Did you happen to see any issues on the sock_map side for close case?
It looks good to me.

.John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage
  2021-10-20  5:28     ` John Fastabend
@ 2021-10-20 15:11       ` Jakub Sitnicki
  2021-10-20 15:51         ` John Fastabend
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Sitnicki @ 2021-10-20 15:11 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong

On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
>> > We do not need to handle unhash from BPF side we can simply wait for the
>> > close to happen. The original concern was a socket could transition from
>> > ESTABLISHED state to a new state while the BPF hook was still attached.
>> > But, we convinced ourself this is no longer possible and we also
>> > improved BPF sockmap to handle listen sockets so this is no longer a
>> > problem.
>> >
>> > More importantly though there are cases where unhash is called when data is
>> > in the receive queue. The BPF unhash logic will flush this data which is
>> > wrong. To be correct it should keep the data in the receive queue and allow
>> > a receiving application to continue reading the data. This may happen when
>> > tcp_abort is received for example. Instead of complicating the logic in
>> > unhash simply moving all this to tcp_close hook solves this.
>> >
>> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>>
>> Doesn't this open the possibility of having a TCP_CLOSE socket in
>> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
>> close it?
>
> Correct it means we may have TCP_CLOSE socket in the map. I'm not
> seeing any problem with this though. A send on the socket would
> fail the sk_state checks in the send hooks. (tcp.c:1245). Receiving
> from the TCP stack would fail with normal TCP stack checks.
>
> Maybe we want a check on redirect into ingress if the sock is in
> ESTABLISHED state as well? I might push that in its own patch
> though it seems related, but I think we should have that there
> regardless of this patch.
>
> Did you happen to see any issues on the sock_map side for close case?
> It looks good to me.

OK, I didn't understand if that was an intended change or not.

If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
a few things come to mind:

1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
   won't allow it. However, with this change we will be able to have a
   TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
   inserting TCP sockets in TCP_CLOSE state should be allowed for
   consistency.

2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
   sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
   TCP_CLOSE state). Today we rely on the fact there that you can't
   insert a TCP_CLOSE socket.

3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
   similar same case as with bpf_sk_lookup_assign() (with a slight
   difference that reuseport allows dispatching to connected UDP
   sockets).

4) Don't know exactly how checks in sockmap redirect helpers would need
   to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
   that's allowed due to a short window of opportunity that opens up
   when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
   BPF_SOCK_OPS_STATE_CB callback happens just before the state is
   switched to TCP_ESTABLISHED.

   TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
   be nice to get an error from the redirect helper. If I understand
   correctly, if the TCP stack drops the packet after BPF verdict has
   selected a socket, only the socket owner will know about by reading
   the error queue.

   OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
   either, but we don't seem to filter it out today, so the helper is
   not airtight.

All in all, sounds like an API change when it comes to corner cases, in
addition to being a fix for the receive queue flush issue which you
explained in the patch description. If possible, would push it through
bpf-next.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage
  2021-10-20 15:11       ` Jakub Sitnicki
@ 2021-10-20 15:51         ` John Fastabend
  2021-10-20 16:35           ` Jakub Sitnicki
  0 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2021-10-20 15:51 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong

Jakub Sitnicki wrote:
> On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> >> > We do not need to handle unhash from BPF side we can simply wait for the
> >> > close to happen. The original concern was a socket could transition from
> >> > ESTABLISHED state to a new state while the BPF hook was still attached.
> >> > But, we convinced ourself this is no longer possible and we also
> >> > improved BPF sockmap to handle listen sockets so this is no longer a
> >> > problem.
> >> >
> >> > More importantly though there are cases where unhash is called when data is
> >> > in the receive queue. The BPF unhash logic will flush this data which is
> >> > wrong. To be correct it should keep the data in the receive queue and allow
> >> > a receiving application to continue reading the data. This may happen when
> >> > tcp_abort is received for example. Instead of complicating the logic in
> >> > unhash simply moving all this to tcp_close hook solves this.
> >> >
> >> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> >> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >> > ---
> >>
> >> Doesn't this open the possibility of having a TCP_CLOSE socket in
> >> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
> >> close it?
> >
> > Correct it means we may have TCP_CLOSE socket in the map. I'm not
> > seeing any problem with this though. A send on the socket would
> > fail the sk_state checks in the send hooks. (tcp.c:1245). Receiving
> > from the TCP stack would fail with normal TCP stack checks.
> >
> > Maybe we want a check on redirect into ingress if the sock is in
> > ESTABLISHED state as well? I might push that in its own patch
> > though it seems related, but I think we should have that there
> > regardless of this patch.
> >
> > Did you happen to see any issues on the sock_map side for close case?
> > It looks good to me.
> 
> OK, I didn't understand if that was an intended change or not.
> 

wrt bpf-next:
The problem is this needs to be backported in some way that fixes the
case for stable kernels as well. We have applications that are throwing
errors when they hit this at the moment.

> If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
> a few things come to mind:

I think what makes most sense is to do the minimal work to fix the
described issue for bpf tree without introducing new issues and
then do the consistency/better cases in bpf-next.

> 
> 1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
>    won't allow it. However, with this change we will be able to have a
>    TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
>    inserting TCP sockets in TCP_CLOSE state should be allowed for
>    consistency.

I agree, but would hold off on this for bpf-next. I missed points
2,3 though in this series.

> 
> 2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
>    sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
>    TCP_CLOSE state). Today we rely on the fact there that you can't
>    insert a TCP_CLOSE socket.

This should be minimal change, just change the logic to allow only
TCP_LISTEN.

--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10402,7 +10402,7 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
                return -EINVAL;
        if (unlikely(sk && sk_is_refcounted(sk)))
                return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
-       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
+       if (unlikely(sk && sk->sk_state != TCP_LISTEN))
                return -ESOCKTNOSUPPORT; /* reject connected sockets */
 
        /* Check if socket is suitable for packet L3/L4 protocol */


> 
> 3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
>    similar same case as with bpf_sk_lookup_assign() (with a slight
>    difference that reuseport allows dispatching to connected UDP
>    sockets).

Is it needed here? There is no obvious check now.  Is ESTABLISHED
state OK here now?

> 
> 4) Don't know exactly how checks in sockmap redirect helpers would need
>    to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
>    that's allowed due to a short window of opportunity that opens up
>    when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
>    BPF_SOCK_OPS_STATE_CB callback happens just before the state is
>    switched to TCP_ESTABLISHED.
> 
>    TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
>    be nice to get an error from the redirect helper. If I understand
>    correctly, if the TCP stack drops the packet after BPF verdict has
>    selected a socket, only the socket owner will know about by reading
>    the error queue.
> 
>    OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
>    either, but we don't seem to filter it out today, so the helper is
>    not airtight.

Right. At the moment for sending we call do_tcp_sendpages() and this
has the normal check ~(TCPF_ESABLISHED | TCPF_CLOSE_WAIT) so we
would return an error. The missing case is ingress. We currently
let these happen and would need a check there. I was thinking
of doing it in a separate patch, but could tack it on to this
series for completeness.

> 
> All in all, sounds like an API change when it comes to corner cases, in
> addition to being a fix for the receive queue flush issue which you
> explained in the patch description. If possible, would push it through
> bpf-next.

I think if we address 2,3,4 then we can fix the described issue
without introducing new cases. And then 1 is great for consistency
but can go via bpf-next?

WDYT.

Thanks,
John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage
  2021-10-20 15:51         ` John Fastabend
@ 2021-10-20 16:35           ` Jakub Sitnicki
  2021-10-21 19:24             ` John Fastabend
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Sitnicki @ 2021-10-20 16:35 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong, Lorenz Bauer,
	Martin KaFai Lau

On Wed, Oct 20, 2021 at 05:51 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
>> > Jakub Sitnicki wrote:
>> >> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
>> >> > We do not need to handle unhash from BPF side we can simply wait for the
>> >> > close to happen. The original concern was a socket could transition from
>> >> > ESTABLISHED state to a new state while the BPF hook was still attached.
>> >> > But, we convinced ourself this is no longer possible and we also
>> >> > improved BPF sockmap to handle listen sockets so this is no longer a
>> >> > problem.
>> >> >
>> >> > More importantly though there are cases where unhash is called when data is
>> >> > in the receive queue. The BPF unhash logic will flush this data which is
>> >> > wrong. To be correct it should keep the data in the receive queue and allow
>> >> > a receiving application to continue reading the data. This may happen when
>> >> > tcp_abort is received for example. Instead of complicating the logic in
>> >> > unhash simply moving all this to tcp_close hook solves this.
>> >> >
>> >> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
>> >> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> >> > ---
>> >>
>> >> Doesn't this open the possibility of having a TCP_CLOSE socket in
>> >> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
>> >> close it?
>> >
>> > Correct it means we may have TCP_CLOSE socket in the map. I'm not
>> > seeing any problem with this though. A send on the socket would
>> > fail the sk_state checks in the send hooks. (tcp.c:1245). Receiving
>> > from the TCP stack would fail with normal TCP stack checks.
>> >
>> > Maybe we want a check on redirect into ingress if the sock is in
>> > ESTABLISHED state as well? I might push that in its own patch
>> > though it seems related, but I think we should have that there
>> > regardless of this patch.
>> >
>> > Did you happen to see any issues on the sock_map side for close case?
>> > It looks good to me.
>>
>> OK, I didn't understand if that was an intended change or not.
>>
>
> wrt bpf-next:
> The problem is this needs to be backported in some way that fixes the
> case for stable kernels as well. We have applications that are throwing
> errors when they hit this at the moment.

Understood.

>> If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
>> a few things come to mind:
>
> I think what makes most sense is to do the minimal work to fix the
> described issue for bpf tree without introducing new issues and
> then do the consistency/better cases in bpf-next.
>
>>
>> 1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
>>    won't allow it. However, with this change we will be able to have a
>>    TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
>>    inserting TCP sockets in TCP_CLOSE state should be allowed for
>>    consistency.
>
> I agree, but would hold off on this for bpf-next. I missed points
> 2,3 though in this series.

OK, that makes sense.

>>
>> 2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
>>    sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
>>    TCP_CLOSE state). Today we rely on the fact there that you can't
>>    insert a TCP_CLOSE socket.
>
> This should be minimal change, just change the logic to allow only
> TCP_LISTEN.
>
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -10402,7 +10402,7 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>                 return -EINVAL;
>         if (unlikely(sk && sk_is_refcounted(sk)))
>                 return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> -       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
> +       if (unlikely(sk && sk->sk_state != TCP_LISTEN))
>                 return -ESOCKTNOSUPPORT; /* reject connected sockets */
>
>         /* Check if socket is suitable for packet L3/L4 protocol */
>
>

Yeah, it shouldn't be hard. But we need to cover UDP as well. Something
along the lines of:

--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10402,8 +10402,10 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
                return -EINVAL;
        if (unlikely(sk && sk_is_refcounted(sk)))
                return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
-       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
-               return -ESOCKTNOSUPPORT; /* reject connected sockets */
+       if (unlikely(sk && sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN))
+               return -ESOCKTNOSUPPORT; /* reject closed TCP sockets */
+       if (unlikely(sk && sk_is_udp(sk) && sk->sk_state != TCP_CLOSE))
+               return -ESOCKTNOSUPPORT; /* reject connected UDP sockets */

        /* Check if socket is suitable for packet L3/L4 protocol */
        if (sk && sk->sk_protocol != ctx->protocol)

We aren't testing today that that error case in sk_lookup test suite,
because it wasn't possible to insert a TCP_CLOSE socket. So once that
gets in, I can add coverage.

>>
>> 3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
>>    similar same case as with bpf_sk_lookup_assign() (with a slight
>>    difference that reuseport allows dispatching to connected UDP
>>    sockets).
>
> Is it needed here? There is no obvious check now.  Is ESTABLISHED
> state OK here now?

TCP ESTABLISHED sockets are not okay. They can't join the reuseport
group and will always hit the !reuse branch.

Re-reading the code, though, I think nothing needs to be done for the
sk_select_reuseport() helper. TCP sockets will be detached from
reuseport group on unhash. Hence TCP_CLOSE socket will also hit the
!reuse branch.

CC'ing Martin just in case he wants to double-check.

>
>>
>> 4) Don't know exactly how checks in sockmap redirect helpers would need
>>    to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
>>    that's allowed due to a short window of opportunity that opens up
>>    when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
>>    BPF_SOCK_OPS_STATE_CB callback happens just before the state is
>>    switched to TCP_ESTABLISHED.
>>
>>    TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
>>    be nice to get an error from the redirect helper. If I understand
>>    correctly, if the TCP stack drops the packet after BPF verdict has
>>    selected a socket, only the socket owner will know about by reading
>>    the error queue.
>>
>>    OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
>>    either, but we don't seem to filter it out today, so the helper is
>>    not airtight.
>
> Right. At the moment for sending we call do_tcp_sendpages() and this
> has the normal check ~(TCPF_ESABLISHED | TCPF_CLOSE_WAIT) so we
> would return an error. The missing case is ingress. We currently
> let these happen and would need a check there. I was thinking
> of doing it in a separate patch, but could tack it on to this
> series for completeness.
>

Oh, yeah, right. I see now what you mean. No problem on egress.

So it's just an SK_DROP return code from bpf_sk_redirect_map() that
could be a potential improvement.

Your call if you want to add it this series. Patching it up as a follow
up works for me as well.

>>
>> All in all, sounds like an API change when it comes to corner cases, in
>> addition to being a fix for the receive queue flush issue which you
>> explained in the patch description. If possible, would push it through
>> bpf-next.
>
> I think if we address 2,3,4 then we can fix the described issue
> without introducing new cases. And then 1 is great for consistency
> but can go via bpf-next?

So (3) is out, reuseport+sockmap users should be unaffected by this.

If you could patch (2) that would be great. We rely on this, and I can't
assume that nobody isn't disconnecting their listener sockets for some
reason.

(4) and (1) can follow later, if you ask me.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage
  2021-10-20 16:35           ` Jakub Sitnicki
@ 2021-10-21 19:24             ` John Fastabend
  0 siblings, 0 replies; 14+ messages in thread
From: John Fastabend @ 2021-10-21 19:24 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> On Wed, Oct 20, 2021 at 05:51 PM CEST, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
> >> > Jakub Sitnicki wrote:
> >> >> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> >> >> > We do not need to handle unhash from BPF side we can simply wait for the
> >> >> > close to happen. The original concern was a socket could transition from
> >> >> > ESTABLISHED state to a new state while the BPF hook was still attached.
> >> >> > But, we convinced ourself this is no longer possible and we also
> >> >> > improved BPF sockmap to handle listen sockets so this is no longer a
> >> >> > problem.
> >> >> >
> >> >> > More importantly though there are cases where unhash is called when data is
> >> >> > in the receive queue. The BPF unhash logic will flush this data which is
> >> >> > wrong. To be correct it should keep the data in the receive queue and allow
> >> >> > a receiving application to continue reading the data. This may happen when
> >> >> > tcp_abort is received for example. Instead of complicating the logic in
> >> >> > unhash simply moving all this to tcp_close hook solves this.
> >> >> >
> >> >> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> >> >> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >> >> > ---
> >> >>
> >> >> Doesn't this open the possibility of having a TCP_CLOSE socket in
> >> >> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
> >> >> close it?
> >> >

[...]

 >> If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
> >> a few things come to mind:
> >
> > I think what makes most sense is to do the minimal work to fix the
> > described issue for bpf tree without introducing new issues and
> > then do the consistency/better cases in bpf-next.
> >
> >>
> >> 1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
> >>    won't allow it. However, with this change we will be able to have a
> >>    TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
> >>    inserting TCP sockets in TCP_CLOSE state should be allowed for
> >>    consistency.
> >
> > I agree, but would hold off on this for bpf-next. I missed points
> > 2,3 though in this series.
> 
> OK, that makes sense.
> 
> >>
> >> 2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
> >>    sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
> >>    TCP_CLOSE state). Today we rely on the fact there that you can't
> >>    insert a TCP_CLOSE socket.
> >
> > This should be minimal change, just change the logic to allow only
> > TCP_LISTEN.
> >
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -10402,7 +10402,7 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> >                 return -EINVAL;
> >         if (unlikely(sk && sk_is_refcounted(sk)))
> >                 return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> > -       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
> > +       if (unlikely(sk && sk->sk_state != TCP_LISTEN))
> >                 return -ESOCKTNOSUPPORT; /* reject connected sockets */
> >
> >         /* Check if socket is suitable for packet L3/L4 protocol */
> >
> >
> 
> Yeah, it shouldn't be hard. But we need to cover UDP as well. Something
> along the lines of:
> 
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -10402,8 +10402,10 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>                 return -EINVAL;
>         if (unlikely(sk && sk_is_refcounted(sk)))
>                 return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> -       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
> -               return -ESOCKTNOSUPPORT; /* reject connected sockets */
> +       if (unlikely(sk && sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN))
> +               return -ESOCKTNOSUPPORT; /* reject closed TCP sockets */
> +       if (unlikely(sk && sk_is_udp(sk) && sk->sk_state != TCP_CLOSE))
> +               return -ESOCKTNOSUPPORT; /* reject connected UDP sockets */
> 
>         /* Check if socket is suitable for packet L3/L4 protocol */
>         if (sk && sk->sk_protocol != ctx->protocol)
> 
> We aren't testing today that that error case in sk_lookup test suite,
> because it wasn't possible to insert a TCP_CLOSE socket. So once that
> gets in, I can add coverage.
> 
> >>
> >> 3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
> >>    similar same case as with bpf_sk_lookup_assign() (with a slight
> >>    difference that reuseport allows dispatching to connected UDP
> >>    sockets).
> >
> > Is it needed here? There is no obvious check now.  Is ESTABLISHED
> > state OK here now?
> 
> TCP ESTABLISHED sockets are not okay. They can't join the reuseport
> group and will always hit the !reuse branch.
> 
> Re-reading the code, though, I think nothing needs to be done for the
> sk_select_reuseport() helper. TCP sockets will be detached from
> reuseport group on unhash. Hence TCP_CLOSE socket will also hit the
> !reuse branch.
> 
> CC'ing Martin just in case he wants to double-check.
> 
> >
> >>
> >> 4) Don't know exactly how checks in sockmap redirect helpers would need
> >>    to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
> >>    that's allowed due to a short window of opportunity that opens up
> >>    when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
> >>    BPF_SOCK_OPS_STATE_CB callback happens just before the state is
> >>    switched to TCP_ESTABLISHED.
> >>
> >>    TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
> >>    be nice to get an error from the redirect helper. If I understand
> >>    correctly, if the TCP stack drops the packet after BPF verdict has
> >>    selected a socket, only the socket owner will know about by reading
> >>    the error queue.
> >>
> >>    OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
> >>    either, but we don't seem to filter it out today, so the helper is
> >>    not airtight.
> >
> > Right. At the moment for sending we call do_tcp_sendpages() and this
> > has the normal check ~(TCPF_ESABLISHED | TCPF_CLOSE_WAIT) so we
> > would return an error. The missing case is ingress. We currently
> > let these happen and would need a check there. I was thinking
> > of doing it in a separate patch, but could tack it on to this
> > series for completeness.
> >
> 
> Oh, yeah, right. I see now what you mean. No problem on egress.
> 
> So it's just an SK_DROP return code from bpf_sk_redirect_map() that
> could be a potential improvement.
> 
> Your call if you want to add it this series. Patching it up as a follow
> up works for me as well.
> 
> >>
> >> All in all, sounds like an API change when it comes to corner cases, in
> >> addition to being a fix for the receive queue flush issue which you
> >> explained in the patch description. If possible, would push it through
> >> bpf-next.
> >
> > I think if we address 2,3,4 then we can fix the described issue
> > without introducing new cases. And then 1 is great for consistency
> > but can go via bpf-next?
> 
> So (3) is out, reuseport+sockmap users should be unaffected by this.
> 
> If you could patch (2) that would be great. We rely on this, and I can't
> assume that nobody isn't disconnecting their listener sockets for some
> reason.

Yep I'll roll a new version with a fix for this and leave the rest for
a follow up.

> 
> (4) and (1) can follow later, if you ask me.

Agree.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 4/4] bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
  2021-10-11 19:16 ` [PATCH bpf 4/4] bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg John Fastabend
@ 2021-10-23 13:05   ` Jakub Sitnicki
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Sitnicki @ 2021-10-23 13:05 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, joamaki, xiyou.wangcong

On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> From: Jussi Maki <joamaki@gmail.com>
>
> The current conversion of skb->data_end reads like this,
>
>   ; data_end = (void*)(long)skb->data_end;
>    559: (79) r1 = *(u64 *)(r2 +200)   ; r1  = skb->data
>    560: (61) r11 = *(u32 *)(r2 +112)  ; r11 = skb->len
>    561: (0f) r1 += r11
>    562: (61) r11 = *(u32 *)(r2 +116)
>    563: (1f) r1 -= r11
>
> But similar to the case
>
>  ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg"),
>
> the code will read an incorrect skb->len when src == dst. In this case we
> end up generating this xlated code.
>
>   ; data_end = (void*)(long)skb->data_end;
>    559: (79) r1 = *(u64 *)(r1 +200)   ; r1  = skb->data
>    560: (61) r11 = *(u32 *)(r1 +112)  ; r11 = (skb->data)->len
>    561: (0f) r1 += r11
>    562: (61) r11 = *(u32 *)(r1 +116)
>    563: (1f) r1 -= r11
>
> where line 560 is the reading 4B of (skb->data + 112) instead of the
> intended skb->len Here the skb pointer in r1 gets set to skb->data and
> the later deref for skb->len ends up following skb->data instead of skb.
>
> This fixes the issue similarly to the patch mentioned above by creating
> an additional temporary variable and using to store the register when
> dst_reg = src_reg. We name the variable bpf_temp_reg and place it in the
> cb context for sk_skb. Then we restore from the temp to ensure nothing
> is lost.
>
> Fixes: 16137b09a66f2 ("bpf: Compute data_end dynamically with JIT code")
> Signed-off-by: Jussi Maki <joamaki@gmail.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-10-23 13:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-11 19:16 [PATCH bpf 0/4] bpf, sockmap: fixes stress testing and regression John Fastabend
2021-10-11 19:16 ` [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage John Fastabend
2021-10-19  7:17   ` Jakub Sitnicki
2021-10-20  5:28     ` John Fastabend
2021-10-20 15:11       ` Jakub Sitnicki
2021-10-20 15:51         ` John Fastabend
2021-10-20 16:35           ` Jakub Sitnicki
2021-10-21 19:24             ` John Fastabend
2021-10-11 19:16 ` [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self John Fastabend
2021-10-19  9:16   ` Jakub Sitnicki
2021-10-11 19:16 ` [PATCH bpf 3/4] bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding John Fastabend
2021-10-19 15:39   ` Jakub Sitnicki
2021-10-11 19:16 ` [PATCH bpf 4/4] bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg John Fastabend
2021-10-23 13:05   ` Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).