From dd935f44a40f8fb02aff2cc0df2269c92422df1c Mon Sep 17 00:00:00 2001 From: Josh Durgin Date: Wed, 28 Aug 2013 21:43:09 -0700 Subject: libceph: add function to ensure notifies are complete Without a way to flush the osd client's notify workqueue, a watch event that is unregistered could continue receiving callbacks indefinitely. Unregistering the event simply means no new notifies are added to the queue, but there may still be events in the queue that will call the watch callback for the event. If the queue is flushed after the event is unregistered, the caller can be sure no more watch callbacks will occur for the canceled watch. Signed-off-by: Josh Durgin Reviewed-by: Sage Weil Reviewed-by: Alex Elder --- net/ceph/osd_client.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'net') diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 1606f740d6ae..2b4b32aaa893 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2215,6 +2215,17 @@ void ceph_osdc_sync(struct ceph_osd_client *osdc) } EXPORT_SYMBOL(ceph_osdc_sync); +/* + * Call all pending notify callbacks - for use after a watch is + * unregistered, to make sure no more callbacks for it will be invoked + */ +extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc) +{ + flush_workqueue(osdc->notify_wq); +} +EXPORT_SYMBOL(ceph_osdc_flush_notifies); + + /* * init, shutdown */ -- cgit v1.2.3-58-ga151 From b0dd663b60944a3ce86430fa35549fb37968bda0 Mon Sep 17 00:00:00 2001 From: Sonic Zhang Date: Wed, 11 Sep 2013 11:31:53 +0800 Subject: netpoll: Should handle ETH_P_ARP other than ETH_P_IP in netpoll_neigh_reply The received ARP request type in the Ethernet packet head is ETH_P_ARP other than ETH_P_IP. [ Bug introduced by commit b7394d2429c198b1da3d46ac39192e891029ec0f ("netpoll: prepare for ipv6") ] Signed-off-by: Sonic Zhang Signed-off-by: David S. Miller --- net/core/netpoll.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 2c637e9a0b27..c3c7b27c112d 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -550,7 +550,7 @@ static void netpoll_neigh_reply(struct sk_buff *skb, struct netpoll_info *npinfo return; proto = ntohs(eth_hdr(skb)->h_proto); - if (proto == ETH_P_IP) { + if (proto == ETH_P_ARP) { struct arphdr *arp; unsigned char *arp_ptr; /* No arp on this interface */ -- cgit v1.2.3-58-ga151 From 95ee62083cb6453e056562d91f597552021e6ae7 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Wed, 11 Sep 2013 16:58:36 +0200 Subject: net: sctp: fix ipv6 ipsec encryption bug in sctp_v6_xmit Alan Chester reported an issue with IPv6 on SCTP that IPsec traffic is not being encrypted, whereas on IPv4 it is. Setting up an AH + ESP transport does not seem to have the desired effect: SCTP + IPv4: 22:14:20.809645 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 116) 192.168.0.2 > 192.168.0.5: AH(spi=0x00000042,sumlen=16,seq=0x1): ESP(spi=0x00000044,seq=0x1), length 72 22:14:20.813270 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 340) 192.168.0.5 > 192.168.0.2: AH(spi=0x00000043,sumlen=16,seq=0x1): SCTP + IPv6: 22:31:19.215029 IP6 (class 0x02, hlim 64, next-header SCTP (132) payload length: 364) fe80::222:15ff:fe87:7fc.3333 > fe80::92e6:baff:fe0d:5a54.36767: sctp 1) [INIT ACK] [init tag: 747759530] [rwnd: 62464] [OS: 10] [MIS: 10] Moreover, Alan says: This problem was seen with both Racoon and Racoon2. Other people have seen this with OpenSwan. When IPsec is configured to encrypt all upper layer protocols the SCTP connection does not initialize. After using Wireshark to follow packets, this is because the SCTP packet leaves Box A unencrypted and Box B believes all upper layer protocols are to be encrypted so it drops this packet, causing the SCTP connection to fail to initialize. When IPsec is configured to encrypt just SCTP, the SCTP packets are observed unencrypted. In fact, using `socat sctp6-listen:3333 -` on one end and transferring "plaintext" string on the other end, results in cleartext on the wire where SCTP eventually does not report any errors, thus in the latter case that Alan reports, the non-paranoid user might think he's communicating over an encrypted transport on SCTP although he's not (tcpdump ... -X): ... 0x0030: 5d70 8e1a 0003 001a 177d eb6c 0000 0000 ]p.......}.l.... 0x0040: 0000 0000 706c 6169 6e74 6578 740a 0000 ....plaintext... Only in /proc/net/xfrm_stat we can see XfrmInTmplMismatch increasing on the receiver side. Initial follow-up analysis from Alan's bug report was done by Alexey Dobriyan. Also thanks to Vlad Yasevich for feedback on this. SCTP has its own implementation of sctp_v6_xmit() not calling inet6_csk_xmit(). This has the implication that it probably never really got updated along with changes in inet6_csk_xmit() and therefore does not seem to invoke xfrm handlers. SCTP's IPv4 xmit however, properly calls ip_queue_xmit() to do the work. Since a call to inet6_csk_xmit() would solve this problem, but result in unecessary route lookups, let us just use the cached flowi6 instead that we got through sctp_v6_get_dst(). Since all SCTP packets are being sent through sctp_packet_transmit(), we do the route lookup / flow caching in sctp_transport_route(), hold it in tp->dst and skb_dst_set() right after that. If we would alter fl6->daddr in sctp_v6_xmit() to np->opt->srcrt, we possibly could run into the same effect of not having xfrm layer pick it up, hence, use fl6_update_dst() in sctp_v6_get_dst() instead to get the correct source routed dst entry, which we assign to the skb. Also source address routing example from 625034113 ("sctp: fix sctp to work with ipv6 source address routing") still works with this patch! Nevertheless, in RFC5095 it is actually 'recommended' to not use that anyway due to traffic amplification [1]. So it seems we're not supposed to do that anyway in sctp_v6_xmit(). Moreover, if we overwrite the flow destination here, the lower IPv6 layer will be unable to put the correct destination address into IP header, as routing header is added in ipv6_push_nfrag_opts() but then probably with wrong final destination. Things aside, result of this patch is that we do not have any XfrmInTmplMismatch increase plus on the wire with this patch it now looks like: SCTP + IPv6: 08:17:47.074080 IP6 2620:52:0:102f:7a2b:cbff:fe27:1b0a > 2620:52:0:102f:213:72ff:fe32:7eba: AH(spi=0x00005fb4,seq=0x1): ESP(spi=0x00005fb5,seq=0x1), length 72 08:17:47.074264 IP6 2620:52:0:102f:213:72ff:fe32:7eba > 2620:52:0:102f:7a2b:cbff:fe27:1b0a: AH(spi=0x00003d54,seq=0x1): ESP(spi=0x00003d55,seq=0x1), length 296 This fixes Kernel Bugzilla 24412. This security issue seems to be present since 2.6.18 kernels. Lets just hope some big passive adversary in the wild didn't have its fun with that. lksctp-tools IPv6 regression test suite passes as well with this patch. [1] http://www.secdev.org/conf/IPv6_RH_security-csw07.pdf Reported-by: Alan Chester Reported-by: Alexey Dobriyan Signed-off-by: Daniel Borkmann Cc: Steffen Klassert Cc: Hannes Frederic Sowa Acked-by: Vlad Yasevich Signed-off-by: David S. Miller --- net/sctp/ipv6.c | 42 +++++++++++++----------------------------- 1 file changed, 13 insertions(+), 29 deletions(-) (limited to 'net') diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c index da613ceae28c..4f52e2ce263d 100644 --- a/net/sctp/ipv6.c +++ b/net/sctp/ipv6.c @@ -204,44 +204,23 @@ out: in6_dev_put(idev); } -/* Based on tcp_v6_xmit() in tcp_ipv6.c. */ static int sctp_v6_xmit(struct sk_buff *skb, struct sctp_transport *transport) { struct sock *sk = skb->sk; struct ipv6_pinfo *np = inet6_sk(sk); - struct flowi6 fl6; - - memset(&fl6, 0, sizeof(fl6)); - - fl6.flowi6_proto = sk->sk_protocol; - - /* Fill in the dest address from the route entry passed with the skb - * and the source address from the transport. - */ - fl6.daddr = transport->ipaddr.v6.sin6_addr; - fl6.saddr = transport->saddr.v6.sin6_addr; - - fl6.flowlabel = np->flow_label; - IP6_ECN_flow_xmit(sk, fl6.flowlabel); - if (ipv6_addr_type(&fl6.saddr) & IPV6_ADDR_LINKLOCAL) - fl6.flowi6_oif = transport->saddr.v6.sin6_scope_id; - else - fl6.flowi6_oif = sk->sk_bound_dev_if; - - if (np->opt && np->opt->srcrt) { - struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt; - fl6.daddr = *rt0->addr; - } + struct flowi6 *fl6 = &transport->fl.u.ip6; pr_debug("%s: skb:%p, len:%d, src:%pI6 dst:%pI6\n", __func__, skb, - skb->len, &fl6.saddr, &fl6.daddr); + skb->len, &fl6->saddr, &fl6->daddr); - SCTP_INC_STATS(sock_net(sk), SCTP_MIB_OUTSCTPPACKS); + IP6_ECN_flow_xmit(sk, fl6->flowlabel); if (!(transport->param_flags & SPP_PMTUD_ENABLE)) skb->local_df = 1; - return ip6_xmit(sk, skb, &fl6, np->opt, np->tclass); + SCTP_INC_STATS(sock_net(sk), SCTP_MIB_OUTSCTPPACKS); + + return ip6_xmit(sk, skb, fl6, np->opt, np->tclass); } /* Returns the dst cache entry for the given source and destination ip @@ -254,10 +233,12 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr, struct dst_entry *dst = NULL; struct flowi6 *fl6 = &fl->u.ip6; struct sctp_bind_addr *bp; + struct ipv6_pinfo *np = inet6_sk(sk); struct sctp_sockaddr_entry *laddr; union sctp_addr *baddr = NULL; union sctp_addr *daddr = &t->ipaddr; union sctp_addr dst_saddr; + struct in6_addr *final_p, final; __u8 matchlen = 0; __u8 bmatchlen; sctp_scope_t scope; @@ -281,7 +262,8 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr, pr_debug("src=%pI6 - ", &fl6->saddr); } - dst = ip6_dst_lookup_flow(sk, fl6, NULL, false); + final_p = fl6_update_dst(fl6, np->opt, &final); + dst = ip6_dst_lookup_flow(sk, fl6, final_p, false); if (!asoc || saddr) goto out; @@ -333,10 +315,12 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr, } } rcu_read_unlock(); + if (baddr) { fl6->saddr = baddr->v6.sin6_addr; fl6->fl6_sport = baddr->v6.sin6_port; - dst = ip6_dst_lookup_flow(sk, fl6, NULL, false); + final_p = fl6_update_dst(fl6, np->opt, &final); + dst = ip6_dst_lookup_flow(sk, fl6, final_p, false); } out: -- cgit v1.2.3-58-ga151 From 9a0620133ccce9dd35c00a96405c8d80938c2cc0 Mon Sep 17 00:00:00 2001 From: Chris Healy Date: Wed, 11 Sep 2013 21:37:47 -0700 Subject: resubmit bridge: fix message_age_timer calculation This changes the message_age_timer calculation to use the BPDU's max age as opposed to the local bridge's max age. This is in accordance with section 8.6.2.3.2 Step 2 of the 802.1D-1998 sprecification. With the current implementation, when running with very large bridge diameters, convergance will not always occur even if a root bridge is configured to have a longer max age. Tested successfully on bridge diameters of ~200. Signed-off-by: Chris Healy Signed-off-by: David S. Miller --- net/bridge/br_stp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c index 1c0a50f13229..f1887ba7fc43 100644 --- a/net/bridge/br_stp.c +++ b/net/bridge/br_stp.c @@ -209,7 +209,7 @@ static void br_record_config_information(struct net_bridge_port *p, p->designated_age = jiffies - bpdu->message_age; mod_timer(&p->message_age_timer, jiffies - + (p->br->max_age - bpdu->message_age)); + + (bpdu->max_age - bpdu->message_age)); } /* called under bridge lock */ -- cgit v1.2.3-58-ga151 From be4f154d5ef0ca147ab6bcd38857a774133f5450 Mon Sep 17 00:00:00 2001 From: Herbert Xu Date: Thu, 12 Sep 2013 17:12:05 +1000 Subject: bridge: Clamp forward_delay when enabling STP At some point limits were added to forward_delay. However, the limits are only enforced when STP is enabled. This created a scenario where you could have a value outside the allowed range while STP is disabled, which then stuck around even after STP is enabled. This patch fixes this by clamping the value when we enable STP. I had to move the locking around a bit to ensure that there is no window where someone could insert a value outside the range while we're in the middle of enabling STP. Signed-off-by: Herbert Xu Cheers, Signed-off-by: David S. Miller --- net/bridge/br_private.h | 1 + net/bridge/br_stp.c | 21 +++++++++++++++------ net/bridge/br_stp_if.c | 12 ++++++++++-- 3 files changed, 26 insertions(+), 8 deletions(-) (limited to 'net') diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 598cb0b333c6..cda83158a21c 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -746,6 +746,7 @@ extern struct net_bridge_port *br_get_port(struct net_bridge *br, extern void br_init_port(struct net_bridge_port *p); extern void br_become_designated_port(struct net_bridge_port *p); +extern void __br_set_forward_delay(struct net_bridge *br, unsigned long t); extern int br_set_forward_delay(struct net_bridge *br, unsigned long x); extern int br_set_hello_time(struct net_bridge *br, unsigned long x); extern int br_set_max_age(struct net_bridge *br, unsigned long x); diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c index f1887ba7fc43..3c86f0538cbb 100644 --- a/net/bridge/br_stp.c +++ b/net/bridge/br_stp.c @@ -544,18 +544,27 @@ int br_set_max_age(struct net_bridge *br, unsigned long val) } +void __br_set_forward_delay(struct net_bridge *br, unsigned long t) +{ + br->bridge_forward_delay = t; + if (br_is_root_bridge(br)) + br->forward_delay = br->bridge_forward_delay; +} + int br_set_forward_delay(struct net_bridge *br, unsigned long val) { unsigned long t = clock_t_to_jiffies(val); + int err = -ERANGE; + spin_lock_bh(&br->lock); if (br->stp_enabled != BR_NO_STP && (t < BR_MIN_FORWARD_DELAY || t > BR_MAX_FORWARD_DELAY)) - return -ERANGE; + goto unlock; - spin_lock_bh(&br->lock); - br->bridge_forward_delay = t; - if (br_is_root_bridge(br)) - br->forward_delay = br->bridge_forward_delay; + __br_set_forward_delay(br, t); + err = 0; + +unlock: spin_unlock_bh(&br->lock); - return 0; + return err; } diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c index d45e760141bb..108084a04671 100644 --- a/net/bridge/br_stp_if.c +++ b/net/bridge/br_stp_if.c @@ -129,6 +129,14 @@ static void br_stp_start(struct net_bridge *br) char *envp[] = { NULL }; r = call_usermodehelper(BR_STP_PROG, argv, envp, UMH_WAIT_PROC); + + spin_lock_bh(&br->lock); + + if (br->bridge_forward_delay < BR_MIN_FORWARD_DELAY) + __br_set_forward_delay(br, BR_MIN_FORWARD_DELAY); + else if (br->bridge_forward_delay < BR_MAX_FORWARD_DELAY) + __br_set_forward_delay(br, BR_MAX_FORWARD_DELAY); + if (r == 0) { br->stp_enabled = BR_USER_STP; br_debug(br, "userspace STP started\n"); @@ -137,10 +145,10 @@ static void br_stp_start(struct net_bridge *br) br_debug(br, "using kernel STP\n"); /* To start timers on any ports left in blocking */ - spin_lock_bh(&br->lock); br_port_state_selection(br); - spin_unlock_bh(&br->lock); } + + spin_unlock_bh(&br->lock); } static void br_stp_stop(struct net_bridge *br) -- cgit v1.2.3-58-ga151 From d830f0fa1dd7ca447c38aec82cd44230e0b7ca75 Mon Sep 17 00:00:00 2001 From: Phil Oester Date: Thu, 12 Sep 2013 18:04:16 -0700 Subject: netfilter: nf_nat_proto_icmpv6:: fix wrong comparison in icmpv6_manip_pkt In commit 58a317f1 (netfilter: ipv6: add IPv6 NAT support), icmpv6_manip_pkt was added with an incorrect comparison of ICMP codes to types. This causes problems when using NAT rules with the --random option. Correct the comparison. This closes netfilter bugzilla #851, reported by Alexander Neumann. Signed-off-by: Phil Oester Signed-off-by: Pablo Neira Ayuso --- net/ipv6/netfilter/nf_nat_proto_icmpv6.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/ipv6/netfilter/nf_nat_proto_icmpv6.c b/net/ipv6/netfilter/nf_nat_proto_icmpv6.c index 61aaf70f376e..2205e8eeeacf 100644 --- a/net/ipv6/netfilter/nf_nat_proto_icmpv6.c +++ b/net/ipv6/netfilter/nf_nat_proto_icmpv6.c @@ -69,8 +69,8 @@ icmpv6_manip_pkt(struct sk_buff *skb, hdr = (struct icmp6hdr *)(skb->data + hdroff); l3proto->csum_update(skb, iphdroff, &hdr->icmp6_cksum, tuple, maniptype); - if (hdr->icmp6_code == ICMPV6_ECHO_REQUEST || - hdr->icmp6_code == ICMPV6_ECHO_REPLY) { + if (hdr->icmp6_type == ICMPV6_ECHO_REQUEST || + hdr->icmp6_type == ICMPV6_ECHO_REPLY) { inet_proto_csum_replace2(&hdr->icmp6_cksum, skb, hdr->icmp6_identifier, tuple->src.u.icmp.id, 0); -- cgit v1.2.3-58-ga151 From 1fb1754a8c70d69ab480763c423e0a74369c4a67 Mon Sep 17 00:00:00 2001 From: Hong Zhiguo Date: Sat, 14 Sep 2013 22:42:27 +0800 Subject: bridge: use br_port_get_rtnl within rtnl lock current br_port_get_rcu is problematic in bridging path (NULL deref). Change these calls in netlink path first. Signed-off-by: Hong Zhiguo Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- net/bridge/br_netlink.c | 4 ++-- net/bridge/br_private.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'net') diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index b9259efa636e..e74ddc1c29a8 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -207,7 +207,7 @@ int br_getlink(struct sk_buff *skb, u32 pid, u32 seq, struct net_device *dev, u32 filter_mask) { int err = 0; - struct net_bridge_port *port = br_port_get_rcu(dev); + struct net_bridge_port *port = br_port_get_rtnl(dev); /* not a bridge port and */ if (!port && !(filter_mask & RTEXT_FILTER_BRVLAN)) @@ -451,7 +451,7 @@ static size_t br_get_link_af_size(const struct net_device *dev) struct net_port_vlans *pv; if (br_port_exists(dev)) - pv = nbp_get_vlan_info(br_port_get_rcu(dev)); + pv = nbp_get_vlan_info(br_port_get_rtnl(dev)); else if (dev->priv_flags & IFF_EBRIDGE) pv = br_get_vlan_info((struct net_bridge *)netdev_priv(dev)); else diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index cda83158a21c..dd583177cba4 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -208,7 +208,7 @@ static inline struct net_bridge_port *br_port_get_rcu(const struct net_device *d return br_port_exists(dev) ? port : NULL; } -static inline struct net_bridge_port *br_port_get_rtnl(struct net_device *dev) +static inline struct net_bridge_port *br_port_get_rtnl(const struct net_device *dev) { return br_port_exists(dev) ? rtnl_dereference(dev->rx_handler_data) : NULL; -- cgit v1.2.3-58-ga151 From 716ec052d2280d511e10e90ad54a86f5b5d4dcc2 Mon Sep 17 00:00:00 2001 From: Hong Zhiguo Date: Sat, 14 Sep 2013 22:42:28 +0800 Subject: bridge: fix NULL pointer deref of br_port_get_rcu The NULL deref happens when br_handle_frame is called between these 2 lines of del_nbp: dev->priv_flags &= ~IFF_BRIDGE_PORT; /* --> br_handle_frame is called at this time */ netdev_rx_handler_unregister(dev); In br_handle_frame the return of br_port_get_rcu(dev) is dereferenced without check but br_port_get_rcu(dev) returns NULL if: !(dev->priv_flags & IFF_BRIDGE_PORT) Eric Dumazet pointed out the testing of IFF_BRIDGE_PORT is not necessary here since we're in rcu_read_lock and we have synchronize_net() in netdev_rx_handler_unregister. So remove the testing of IFF_BRIDGE_PORT and by the previous patch, make sure br_port_get_rcu is called in bridging code. Signed-off-by: Hong Zhiguo Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- net/bridge/br_private.h | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'net') diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index dd583177cba4..efb57d911569 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -202,10 +202,7 @@ struct net_bridge_port static inline struct net_bridge_port *br_port_get_rcu(const struct net_device *dev) { - struct net_bridge_port *port = - rcu_dereference_rtnl(dev->rx_handler_data); - - return br_port_exists(dev) ? port : NULL; + return rcu_dereference(dev->rx_handler_data); } static inline struct net_bridge_port *br_port_get_rtnl(const struct net_device *dev) -- cgit v1.2.3-58-ga151 From f8776218e8546397be64ad2bc0ebf4748522d6e3 Mon Sep 17 00:00:00 2001 From: Andre Guedes Date: Wed, 31 Jul 2013 16:25:28 -0300 Subject: Bluetooth: Fix security level for peripheral role While playing the peripheral role, the host gets a LE Long Term Key Request Event from the controller when a connection is established with a bonded device. The host then informs the LTK which should be used for the connection. Once the link is encrypted, the host gets an Encryption Change Event. Therefore we should set conn->pending_sec_level instead of conn-> sec_level in hci_le_ltk_request_evt. This way, conn->sec_level is properly updated in hci_encrypt_change_evt. Moreover, since we have a LTK associated to the device, we have at least BT_SECURITY_MEDIUM security level. Cc: Stable Signed-off-by: Andre Guedes Signed-off-by: Gustavo Padovan --- net/bluetooth/hci_event.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c index 94aab73f89d4..04967248f899 100644 --- a/net/bluetooth/hci_event.c +++ b/net/bluetooth/hci_event.c @@ -3557,7 +3557,9 @@ static void hci_le_ltk_request_evt(struct hci_dev *hdev, struct sk_buff *skb) cp.handle = cpu_to_le16(conn->handle); if (ltk->authenticated) - conn->sec_level = BT_SECURITY_HIGH; + conn->pending_sec_level = BT_SECURITY_HIGH; + else + conn->pending_sec_level = BT_SECURITY_MEDIUM; hci_send_cmd(hdev, HCI_OP_LE_LTK_REPLY, sizeof(cp), &cp); -- cgit v1.2.3-58-ga151 From 89cbb4da0abee2f39d75f67f9fd57f7410c8b65c Mon Sep 17 00:00:00 2001 From: Andre Guedes Date: Wed, 31 Jul 2013 16:25:29 -0300 Subject: Bluetooth: Fix encryption key size for peripheral role This patch fixes the connection encryption key size information when the host is playing the peripheral role. We should set conn->enc_key_ size in hci_le_ltk_request_evt, otherwise it is left uninitialized. Cc: Stable Signed-off-by: Andre Guedes Signed-off-by: Gustavo Padovan --- net/bluetooth/hci_event.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'net') diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c index 04967248f899..8db3e89fae35 100644 --- a/net/bluetooth/hci_event.c +++ b/net/bluetooth/hci_event.c @@ -3561,6 +3561,8 @@ static void hci_le_ltk_request_evt(struct hci_dev *hdev, struct sk_buff *skb) else conn->pending_sec_level = BT_SECURITY_MEDIUM; + conn->enc_key_size = ltk->enc_size; + hci_send_cmd(hdev, HCI_OP_LE_LTK_REPLY, sizeof(cp), &cp); if (ltk->type & HCI_SMP_STK) { -- cgit v1.2.3-58-ga151 From 330b6c1521d76d8b88513fbafe18709ad86dafc4 Mon Sep 17 00:00:00 2001 From: Syam Sidhardhan Date: Tue, 6 Aug 2013 01:59:12 +0900 Subject: Bluetooth: Fix ACL alive for long in case of non pariable devices For certain devices (ex: HID mouse), support for authentication, pairing and bonding is optional. For such devices, the ACL alive for too long after the L2CAP disconnection. To avoid the ACL alive for too long after L2CAP disconnection, reset the ACL disconnect timeout back to HCI_DISCONN_TIMEOUT during L2CAP connect. While merging the commit id:a9ea3ed9b71cc3271dd59e76f65748adcaa76422 this issue might have introduced. Hcidump info: sh-4.1# /opt/hcidump -Xt 2013-08-05 16:49:00.894129 < ACL data: handle 12 flags 0x00 dlen 12 L2CAP(s): Disconn req: dcid 0x004a scid 0x0041 2013-08-05 16:49:00.894195 < HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2 handle 12 2013-08-05 16:49:00.894269 < ACL data: handle 12 flags 0x00 dlen 12 L2CAP(s): Disconn req: dcid 0x0049 scid 0x0040 2013-08-05 16:49:00.895645 > HCI Event: Command Status (0x0f) plen 4 Exit Sniff Mode (0x02|0x0004) status 0x00 ncmd 1 2013-08-05 16:49:00.934391 > HCI Event: Mode Change (0x14) plen 6 status 0x00 handle 12 mode 0x00 interval 0 Mode: Active 2013-08-05 16:49:00.936592 > HCI Event: Number of Completed Packets (0x13) plen 5 handle 12 packets 2 2013-08-05 16:49:00.951577 > ACL data: handle 12 flags 0x02 dlen 12 L2CAP(s): Disconn rsp: dcid 0x004a scid 0x0041 2013-08-05 16:49:00.952820 > ACL data: handle 12 flags 0x02 dlen 12 L2CAP(s): Disconn rsp: dcid 0x0049 scid 0x0040 2013-08-05 16:49:00.969165 > HCI Event: Mode Change (0x14) plen 6 status 0x00 handle 12 mode 0x02 interval 50 Mode: Sniff 2013-08-05 16:49:48.175533 > HCI Event: Mode Change (0x14) plen 6 status 0x00 handle 12 mode 0x00 interval 0 Mode: Active 2013-08-05 16:49:48.219045 > HCI Event: Mode Change (0x14) plen 6 status 0x00 handle 12 mode 0x02 interval 108 Mode: Sniff 2013-08-05 16:51:00.968209 < HCI Command: Disconnect (0x01|0x0006) plen 3 handle 12 reason 0x13 Reason: Remote User Terminated Connection 2013-08-05 16:51:00.969056 > HCI Event: Command Status (0x0f) plen 4 Disconnect (0x01|0x0006) status 0x00 ncmd 1 2013-08-05 16:51:01.013495 > HCI Event: Mode Change (0x14) plen 6 status 0x00 handle 12 mode 0x00 interval 0 Mode: Active 2013-08-05 16:51:01.073777 > HCI Event: Disconn Complete (0x05) plen 4 status 0x00 handle 12 reason 0x16 Reason: Connection Terminated by Local Host ============================ After fix ================================ 2013-08-05 16:57:35.986648 < ACL data: handle 11 flags 0x00 dlen 12 L2CAP(s): Disconn req: dcid 0x004c scid 0x0041 2013-08-05 16:57:35.986713 < HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2 handle 11 2013-08-05 16:57:35.986785 < ACL data: handle 11 flags 0x00 dlen 12 L2CAP(s): Disconn req: dcid 0x004b scid 0x0040 2013-08-05 16:57:35.988110 > HCI Event: Command Status (0x0f) plen 4 Exit Sniff Mode (0x02|0x0004) status 0x00 ncmd 1 2013-08-05 16:57:36.030714 > HCI Event: Mode Change (0x14) plen 6 status 0x00 handle 11 mode 0x00 interval 0 Mode: Active 2013-08-05 16:57:36.032950 > HCI Event: Number of Completed Packets (0x13) plen 5 handle 11 packets 2 2013-08-05 16:57:36.047926 > ACL data: handle 11 flags 0x02 dlen 12 L2CAP(s): Disconn rsp: dcid 0x004c scid 0x0041 2013-08-05 16:57:36.049200 > ACL data: handle 11 flags 0x02 dlen 12 L2CAP(s): Disconn rsp: dcid 0x004b scid 0x0040 2013-08-05 16:57:36.065509 > HCI Event: Mode Change (0x14) plen 6 status 0x00 handle 11 mode 0x02 interval 50 Mode: Sniff 2013-08-05 16:57:40.052006 < HCI Command: Disconnect (0x01|0x0006) plen 3 handle 11 reason 0x13 Reason: Remote User Terminated Connection 2013-08-05 16:57:40.052869 > HCI Event: Command Status (0x0f) plen 4 Disconnect (0x01|0x0006) status 0x00 ncmd 1 2013-08-05 16:57:40.104731 > HCI Event: Mode Change (0x14) plen 6 status 0x00 handle 11 mode 0x00 interval 0 Mode: Active 2013-08-05 16:57:40.146935 > HCI Event: Disconn Complete (0x05) plen 4 status 0x00 handle 11 reason 0x16 Reason: Connection Terminated by Local Host Signed-off-by: Sang-Ki Park Signed-off-by: Chan-yeol Park Signed-off-by: Jaganath Kanakkassery Signed-off-by: Szymon Janc Signed-off-by: Syam Sidhardhan Signed-off-by: Gustavo Padovan --- net/bluetooth/l2cap_core.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'net') diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c index b3bb7bca8e60..63fa11109a1c 100644 --- a/net/bluetooth/l2cap_core.c +++ b/net/bluetooth/l2cap_core.c @@ -3755,6 +3755,13 @@ static struct l2cap_chan *l2cap_connect(struct l2cap_conn *conn, sk = chan->sk; + /* For certain devices (ex: HID mouse), support for authentication, + * pairing and bonding is optional. For such devices, inorder to avoid + * the ACL alive for too long after L2CAP disconnection, reset the ACL + * disc_timeout back to HCI_DISCONN_TIMEOUT during L2CAP connect. + */ + conn->hcon->disc_timeout = HCI_DISCONN_TIMEOUT; + bacpy(&bt_sk(sk)->src, conn->src); bacpy(&bt_sk(sk)->dst, conn->dst); chan->psm = psm; -- cgit v1.2.3-58-ga151 From 55524c219aa803887d1c247853842a9566598cba Mon Sep 17 00:00:00 2001 From: Jozsef Kadlecsik Date: Mon, 16 Sep 2013 20:00:08 +0200 Subject: netfilter: ipset: Skip really non-first fragments for IPv6 when getting port/protocol Signed-off-by: Jozsef Kadlecsik --- net/netfilter/ipset/ip_set_getport.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/netfilter/ipset/ip_set_getport.c b/net/netfilter/ipset/ip_set_getport.c index 6fdf88ae2353..dac156f819ac 100644 --- a/net/netfilter/ipset/ip_set_getport.c +++ b/net/netfilter/ipset/ip_set_getport.c @@ -116,12 +116,12 @@ ip_set_get_ip6_port(const struct sk_buff *skb, bool src, { int protoff; u8 nexthdr; - __be16 frag_off; + __be16 frag_off = 0; nexthdr = ipv6_hdr(skb)->nexthdr; protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &nexthdr, &frag_off); - if (protoff < 0) + if (protoff < 0 || (frag_off & htons(~0x7)) != 0) return false; return get_port(skb, nexthdr, protoff, src, port, proto); -- cgit v1.2.3-58-ga151 From 0f1799ba1a5db4c48b72ac2da2dc70d8c190a73d Mon Sep 17 00:00:00 2001 From: Jozsef Kadlecsik Date: Mon, 16 Sep 2013 20:04:53 +0200 Subject: netfilter: ipset: Consistent userspace testing with nomatch flag The "nomatch" commandline flag should invert the matching at testing, similarly to the --return-nomatch flag of the "set" match of iptables. Until now it worked with the elements with "nomatch" flag only. From now on it works with elements without the flag too, i.e: # ipset n test hash:net # ipset a test 10.0.0.0/24 nomatch # ipset t test 10.0.0.1 10.0.0.1 is NOT in set test. # ipset t test 10.0.0.1 nomatch 10.0.0.1 is in set test. # ipset a test 192.168.0.0/24 # ipset t test 192.168.0.1 192.168.0.1 is in set test. # ipset t test 192.168.0.1 nomatch 192.168.0.1 is NOT in set test. Before the patch the results were ... # ipset t test 192.168.0.1 192.168.0.1 is in set test. # ipset t test 192.168.0.1 nomatch 192.168.0.1 is in set test. Signed-off-by: Jozsef Kadlecsik --- include/linux/netfilter/ipset/ip_set.h | 6 ++++-- net/netfilter/ipset/ip_set_core.c | 3 +-- net/netfilter/ipset/ip_set_hash_ipportnet.c | 4 ++-- net/netfilter/ipset/ip_set_hash_net.c | 4 ++-- net/netfilter/ipset/ip_set_hash_netiface.c | 4 ++-- net/netfilter/ipset/ip_set_hash_netport.c | 4 ++-- 6 files changed, 13 insertions(+), 12 deletions(-) (limited to 'net') diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h index d80e2753847c..9ac9fbde7b61 100644 --- a/include/linux/netfilter/ipset/ip_set.h +++ b/include/linux/netfilter/ipset/ip_set.h @@ -296,10 +296,12 @@ ip_set_eexist(int ret, u32 flags) /* Match elements marked with nomatch */ static inline bool -ip_set_enomatch(int ret, u32 flags, enum ipset_adt adt) +ip_set_enomatch(int ret, u32 flags, enum ipset_adt adt, struct ip_set *set) { return adt == IPSET_TEST && - ret == -ENOTEMPTY && ((flags >> 16) & IPSET_FLAG_NOMATCH); + (set->type->features & IPSET_TYPE_NOMATCH) && + ((flags >> 16) & IPSET_FLAG_NOMATCH) && + (ret > 0 || ret == -ENOTEMPTY); } /* Check the NLA_F_NET_BYTEORDER flag */ diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index f77139007983..c8c303c3386f 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -1489,8 +1489,7 @@ ip_set_utest(struct sock *ctnl, struct sk_buff *skb, if (ret == -EAGAIN) ret = 1; - return (ret < 0 && ret != -ENOTEMPTY) ? ret : - ret > 0 ? 0 : -IPSET_ERR_EXIST; + return ret > 0 ? 0 : -IPSET_ERR_EXIST; } /* Get headed data of a set */ diff --git a/net/netfilter/ipset/ip_set_hash_ipportnet.c b/net/netfilter/ipset/ip_set_hash_ipportnet.c index c6a525373be4..f15f3e28b9c3 100644 --- a/net/netfilter/ipset/ip_set_hash_ipportnet.c +++ b/net/netfilter/ipset/ip_set_hash_ipportnet.c @@ -260,7 +260,7 @@ hash_ipportnet4_uadt(struct ip_set *set, struct nlattr *tb[], e.ip = htonl(ip); e.ip2 = htonl(ip2_from & ip_set_hostmask(e.cidr + 1)); ret = adtfn(set, &e, &ext, &ext, flags); - return ip_set_enomatch(ret, flags, adt) ? 1 : + return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } @@ -544,7 +544,7 @@ hash_ipportnet6_uadt(struct ip_set *set, struct nlattr *tb[], if (adt == IPSET_TEST || !with_ports || !tb[IPSET_ATTR_PORT_TO]) { ret = adtfn(set, &e, &ext, &ext, flags); - return ip_set_enomatch(ret, flags, adt) ? 1 : + return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } diff --git a/net/netfilter/ipset/ip_set_hash_net.c b/net/netfilter/ipset/ip_set_hash_net.c index da740ceb56ae..223e9f546d0f 100644 --- a/net/netfilter/ipset/ip_set_hash_net.c +++ b/net/netfilter/ipset/ip_set_hash_net.c @@ -199,7 +199,7 @@ hash_net4_uadt(struct ip_set *set, struct nlattr *tb[], if (adt == IPSET_TEST || !tb[IPSET_ATTR_IP_TO]) { e.ip = htonl(ip & ip_set_hostmask(e.cidr)); ret = adtfn(set, &e, &ext, &ext, flags); - return ip_set_enomatch(ret, flags, adt) ? 1 : + return ip_set_enomatch(ret, flags, adt, set) ? -ret: ip_set_eexist(ret, flags) ? 0 : ret; } @@ -396,7 +396,7 @@ hash_net6_uadt(struct ip_set *set, struct nlattr *tb[], ret = adtfn(set, &e, &ext, &ext, flags); - return ip_set_enomatch(ret, flags, adt) ? 1 : + return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c index 84ae6f6ce624..7d798d5d5cd3 100644 --- a/net/netfilter/ipset/ip_set_hash_netiface.c +++ b/net/netfilter/ipset/ip_set_hash_netiface.c @@ -368,7 +368,7 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[], if (adt == IPSET_TEST || !tb[IPSET_ATTR_IP_TO]) { e.ip = htonl(ip & ip_set_hostmask(e.cidr)); ret = adtfn(set, &e, &ext, &ext, flags); - return ip_set_enomatch(ret, flags, adt) ? 1 : + return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } @@ -634,7 +634,7 @@ hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[], ret = adtfn(set, &e, &ext, &ext, flags); - return ip_set_enomatch(ret, flags, adt) ? 1 : + return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } diff --git a/net/netfilter/ipset/ip_set_hash_netport.c b/net/netfilter/ipset/ip_set_hash_netport.c index 9a0869853be5..09d6690bee6f 100644 --- a/net/netfilter/ipset/ip_set_hash_netport.c +++ b/net/netfilter/ipset/ip_set_hash_netport.c @@ -244,7 +244,7 @@ hash_netport4_uadt(struct ip_set *set, struct nlattr *tb[], if (adt == IPSET_TEST || !(with_ports || tb[IPSET_ATTR_IP_TO])) { e.ip = htonl(ip & ip_set_hostmask(e.cidr + 1)); ret = adtfn(set, &e, &ext, &ext, flags); - return ip_set_enomatch(ret, flags, adt) ? 1 : + return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } @@ -489,7 +489,7 @@ hash_netport6_uadt(struct ip_set *set, struct nlattr *tb[], if (adt == IPSET_TEST || !with_ports || !tb[IPSET_ATTR_PORT_TO]) { ret = adtfn(set, &e, &ext, &ext, flags); - return ip_set_enomatch(ret, flags, adt) ? 1 : + return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } -- cgit v1.2.3-58-ga151 From 169faa2e19478b02027df04582ec7543dba1dd16 Mon Sep 17 00:00:00 2001 From: Jozsef Kadlecsik Date: Mon, 16 Sep 2013 20:07:35 +0200 Subject: netfilter: ipset: Validate the set family and not the set type family at swapping This closes netfilter bugzilla #843, reported by Quentin Armitage. Signed-off-by: Jozsef Kadlecsik --- net/netfilter/ipset/ip_set_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index c8c303c3386f..f2e30fb31e78 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -1052,7 +1052,7 @@ ip_set_swap(struct sock *ctnl, struct sk_buff *skb, * Not an artificial restriction anymore, as we must prevent * possible loops created by swapping in setlist type of sets. */ if (!(from->type->features == to->type->features && - from->type->family == to->type->family)) + from->family == to->family)) return -IPSET_ERR_TYPE_MISMATCH; strncpy(from_name, from->name, IPSET_MAXNAMELEN); -- cgit v1.2.3-58-ga151 From 2cf55125c64d64cc106e204d53b107094762dfdf Mon Sep 17 00:00:00 2001 From: Oliver Smith Date: Mon, 16 Sep 2013 20:30:57 +0200 Subject: netfilter: ipset: Fix serious failure in CIDR tracking This fixes a serious bug affecting all hash types with a net element - specifically, if a CIDR value is deleted such that none of the same size exist any more, all larger (less-specific) values will then fail to match. Adding back any prefix with a CIDR equal to or more specific than the one deleted will fix it. Steps to reproduce: ipset -N test hash:net ipset -A test 1.1.0.0/16 ipset -A test 2.2.2.0/24 ipset -T test 1.1.1.1 #1.1.1.1 IS in set ipset -D test 2.2.2.0/24 ipset -T test 1.1.1.1 #1.1.1.1 IS NOT in set This is due to the fact that the nets counter was unconditionally decremented prior to the iteration that shifts up the entries. Now, we first check if there is a proceeding entry and if not, decrement it and return. Otherwise, we proceed to iterate and then zero the last element, which, in most cases, will already be zero. Signed-off-by: Oliver Smith Signed-off-by: Jozsef Kadlecsik --- net/netfilter/ipset/ip_set_hash_gen.h | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) (limited to 'net') diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h index 57beb1762b2d..707bc520d629 100644 --- a/net/netfilter/ipset/ip_set_hash_gen.h +++ b/net/netfilter/ipset/ip_set_hash_gen.h @@ -325,18 +325,22 @@ mtype_add_cidr(struct htype *h, u8 cidr, u8 nets_length) static void mtype_del_cidr(struct htype *h, u8 cidr, u8 nets_length) { - u8 i, j; - - for (i = 0; i < nets_length - 1 && h->nets[i].cidr != cidr; i++) - ; - h->nets[i].nets--; - - if (h->nets[i].nets != 0) - return; - - for (j = i; j < nets_length - 1 && h->nets[j].nets; j++) { - h->nets[j].cidr = h->nets[j + 1].cidr; - h->nets[j].nets = h->nets[j + 1].nets; + u8 i, j, net_end = nets_length - 1; + + for (i = 0; i < nets_length; i++) { + if (h->nets[i].cidr != cidr) + continue; + if (h->nets[i].nets > 1 || i == net_end || + h->nets[i + 1].nets == 0) { + h->nets[i].nets--; + return; + } + for (j = i; j < net_end && h->nets[j].nets; j++) { + h->nets[j].cidr = h->nets[j + 1].cidr; + h->nets[j].nets = h->nets[j + 1].nets; + } + h->nets[j].nets = 0; + return; } } #endif -- cgit v1.2.3-58-ga151 From 0d2ede929f61783aebfb9228e4d32a0546ee4d23 Mon Sep 17 00:00:00 2001 From: Ding Zhi Date: Mon, 16 Sep 2013 11:31:15 +0200 Subject: ip6_tunnels: raddr and laddr are inverted in nl msg IFLA_IPTUN_LOCAL and IFLA_IPTUN_REMOTE were inverted. Introduced by c075b13098b3 (ip6tnl: advertise tunnel param via rtnl). Signed-off-by: Ding Zhi Signed-off-by: Nicolas Dichtel Signed-off-by: David S. Miller --- net/ipv6/ip6_tunnel.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 61355f7f4da5..2d8f4829575b 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -1656,9 +1656,9 @@ static int ip6_tnl_fill_info(struct sk_buff *skb, const struct net_device *dev) if (nla_put_u32(skb, IFLA_IPTUN_LINK, parm->link) || nla_put(skb, IFLA_IPTUN_LOCAL, sizeof(struct in6_addr), - &parm->raddr) || - nla_put(skb, IFLA_IPTUN_REMOTE, sizeof(struct in6_addr), &parm->laddr) || + nla_put(skb, IFLA_IPTUN_REMOTE, sizeof(struct in6_addr), + &parm->raddr) || nla_put_u8(skb, IFLA_IPTUN_TTL, parm->hop_limit) || nla_put_u8(skb, IFLA_IPTUN_ENCAP_LIMIT, parm->encap_limit) || nla_put_be32(skb, IFLA_IPTUN_FLOWINFO, parm->flowinfo) || -- cgit v1.2.3-58-ga151 From 3f96a532113131d5a65ac9e00fc83cfa31b0295f Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Mon, 16 Sep 2013 12:36:02 +0200 Subject: net: sctp: rfc4443: do not report ICMP redirects to user space Adapt the same behaviour for SCTP as present in TCP for ICMP redirect messages. For IPv6, RFC4443, section 2.4. says: ... (e) An ICMPv6 error message MUST NOT be originated as a result of receiving the following: ... (e.2) An ICMPv6 redirect message [IPv6-DISC]. ... Therefore, do not report an error to user space, just invoke dst's redirect callback and leave, same for IPv4 as done in TCP as well. The implication w/o having this patch could be that the reception of such packets would generate a poll notification and in worst case it could even tear down the whole connection. Therefore, stop updating sk_err on redirects. Reported-by: Duan Jiong Reported-by: Hannes Frederic Sowa Suggested-by: Vlad Yasevich Signed-off-by: Daniel Borkmann Signed-off-by: David S. Miller --- net/sctp/input.c | 3 +-- net/sctp/ipv6.c | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) (limited to 'net') diff --git a/net/sctp/input.c b/net/sctp/input.c index 5f2068679f83..98b69bbecdd9 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -634,8 +634,7 @@ void sctp_v4_err(struct sk_buff *skb, __u32 info) break; case ICMP_REDIRECT: sctp_icmp_redirect(sk, transport, skb); - err = 0; - break; + /* Fall through to out_unlock. */ default: goto out_unlock; } diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c index 4f52e2ce263d..e7b2d4fe2b6a 100644 --- a/net/sctp/ipv6.c +++ b/net/sctp/ipv6.c @@ -183,7 +183,7 @@ static void sctp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, break; case NDISC_REDIRECT: sctp_icmp_redirect(sk, transport, skb); - break; + goto out_unlock; default: break; } -- cgit v1.2.3-58-ga151 From 0a0d80eb39aa465b7bdf6f7754d0ba687eb3d2a7 Mon Sep 17 00:00:00 2001 From: Gao feng Date: Tue, 17 Sep 2013 13:03:47 +0200 Subject: netfilter: nfnetlink_queue: use network skb for sequence adjustment Instead of the netlink skb. Signed-off-by: Gao feng Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nfnetlink_queue_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c index 95a98c8c1da6..ae2e5c11d01a 100644 --- a/net/netfilter/nfnetlink_queue_core.c +++ b/net/netfilter/nfnetlink_queue_core.c @@ -1009,7 +1009,7 @@ nfqnl_recv_verdict(struct sock *ctnl, struct sk_buff *skb, verdict = NF_DROP; if (ct) - nfqnl_ct_seq_adjust(skb, ct, ctinfo, diff); + nfqnl_ct_seq_adjust(entry->skb, ct, ctinfo, diff); } if (nfqa[NFQA_MARK]) -- cgit v1.2.3-58-ga151 From 4c18c425b2d228415b635e97a64737d7f27c5536 Mon Sep 17 00:00:00 2001 From: Antonio Quartulli Date: Wed, 11 Sep 2013 19:14:44 +0200 Subject: batman-adv: set the TAG flag for the vid passed to BLA When receiving or sending a packet a packet on a VLAN, the vid has to be marked with the TAG flag in order to make any component in batman-adv understand that the packet is coming from a really tagged network. This fix the Bridge Loop Avoidance behaviour which was not able to send announces over VLAN interfaces. Introduced by 0b1da1765fdb00ca5d53bc95c9abc70dfc9aae5b ("batman-adv: change VID semantic in the BLA code") Signed-off-by: Antonio Quartulli Acked-by: Simon Wunderlich Signed-off-by: Marek Lindner --- net/batman-adv/soft-interface.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'net') diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index 4493913f0d5c..813db4e64602 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -168,6 +168,7 @@ static int batadv_interface_tx(struct sk_buff *skb, case ETH_P_8021Q: vhdr = (struct vlan_ethhdr *)skb->data; vid = ntohs(vhdr->h_vlan_TCI) & VLAN_VID_MASK; + vid |= BATADV_VLAN_HAS_TAG; if (vhdr->h_vlan_encapsulated_proto != ethertype) break; @@ -331,6 +332,7 @@ void batadv_interface_rx(struct net_device *soft_iface, case ETH_P_8021Q: vhdr = (struct vlan_ethhdr *)skb->data; vid = ntohs(vhdr->h_vlan_TCI) & VLAN_VID_MASK; + vid |= BATADV_VLAN_HAS_TAG; if (vhdr->h_vlan_encapsulated_proto != ethertype) break; -- cgit v1.2.3-58-ga151 From 269aa759b474570fa642452742741525cfc226a9 Mon Sep 17 00:00:00 2001 From: Neal Cardwell Date: Mon, 16 Sep 2013 21:44:20 -0400 Subject: tcp: fix RTO calculated from cached RTT Commit 1b7fdd2ab5852 ("tcp: do not use cached RTT for RTT estimation") did not correctly account for the fact that crtt is the RTT shifted left 3 bits. Fix the calculation to consistently reflect this fact. Signed-off-by: Neal Cardwell Cc: Eric Dumazet Cc: Yuchung Cheng Acked-by: Eric Dumazet Acked-By: Yuchung Cheng Signed-off-by: David S. Miller --- net/ipv4/tcp_metrics.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index 4a22f3e715df..52f3c6b971d2 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -502,7 +502,9 @@ reset: * ACKs, wait for troubles. */ if (crtt > tp->srtt) { - inet_csk(sk)->icsk_rto = crtt + max(crtt >> 2, tcp_rto_min(sk)); + /* Set RTO like tcp_rtt_estimator(), but from cached RTT. */ + crtt >>= 3; + inet_csk(sk)->icsk_rto = crtt + max(2 * crtt, tcp_rto_min(sk)); } else if (tp->srtt == 0) { /* RFC6298: 5.7 We've failed to get a valid RTT sample from * 3WHS. This is most likely due to retransmission, -- cgit v1.2.3-58-ga151 From a0f6ed8ebe4f6d494ef70f67d4c0c153cbf59577 Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Wed, 18 Sep 2013 11:16:03 -0400 Subject: RPCSEC_GSS: fix crash on destroying gss auth This fixes a regression since eb6dc19d8e72ce3a957af5511d20c0db0a8bd007 "RPCSEC_GSS: Share all credential caches on a per-transport basis" which could cause an occasional oops in the nfsd code (see below). The problem was that an auth was left referencing a client that had been freed. To avoid this we need to ensure that auths are shared only between descendants of a common client; the fact that a clone of an rpc_client takes a reference on its parent then ensures that the parent client will last as long as the auth. Also add a comment explaining what I think was the intention of this code. general protection fault: 0000 [#1] PREEMPT SMP Modules linked in: rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc CPU: 3 PID: 4071 Comm: kworker/u8:2 Not tainted 3.11.0-rc2-00182-g025145f #1665 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: nfsd4_callbacks nfsd4_do_callback_rpc [nfsd] task: ffff88003e206080 ti: ffff88003c384000 task.ti: ffff88003c384000 RIP: 0010:[] [] rpc_net_ns+0x53/0x70 [sunrpc] RSP: 0000:ffff88003c385ab8 EFLAGS: 00010246 RAX: 6b6b6b6b6b6b6b6b RBX: ffff88003af9a800 RCX: 0000000000000002 RDX: ffffffffa00001a5 RSI: 0000000000000001 RDI: ffffffff81e284e0 RBP: ffff88003c385ad8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000015 R12: ffff88003c990840 R13: ffff88003c990878 R14: ffff88003c385ba8 R15: ffff88003e206080 FS: 0000000000000000(0000) GS:ffff88003fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fcdf737e000 CR3: 000000003ad2b000 CR4: 00000000000006e0 Stack: ffffffffa00001a5 0000000000000006 0000000000000006 ffff88003af9a800 ffff88003c385b08 ffffffffa00d52a4 ffff88003c385ba8 ffff88003c751bd8 ffff88003c751bc0 ffff88003e113600 ffff88003c385b18 ffffffffa00d530c Call Trace: [] ? rpc_net_ns+0x5/0x70 [sunrpc] [] __gss_pipe_release+0x54/0x90 [auth_rpcgss] [] gss_pipe_free+0x2c/0x30 [auth_rpcgss] [] gss_destroy+0x9b/0xf0 [auth_rpcgss] [] rpcauth_release+0x23/0x30 [sunrpc] [] rpc_release_client+0x51/0xb0 [sunrpc] [] rpc_shutdown_client+0xe5/0x170 [sunrpc] [] ? cpuacct_charge+0xa4/0xb0 [] ? cpuacct_charge+0x5/0xb0 [] nfsd4_process_cb_update.isra.17+0x2f/0x210 [nfsd] [] ? _raw_spin_unlock_irq+0x30/0x60 [] ? _raw_spin_unlock_irq+0x3b/0x60 [] ? process_one_work+0x15b/0x510 [] nfsd4_do_callback_rpc+0x8d/0xa0 [nfsd] [] process_one_work+0x1ce/0x510 [] ? process_one_work+0x15b/0x510 [] worker_thread+0x11b/0x370 [] ? manage_workers.isra.24+0x2b0/0x2b0 [] kthread+0xdb/0xe0 [] ? _raw_spin_unlock_irq+0x30/0x60 [] ? __init_kthread_worker+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? __init_kthread_worker+0x70/0x70 Code: a5 01 00 a0 31 d2 31 f6 48 c7 c7 e0 84 e2 81 e8 f4 91 0a e1 48 8b 43 60 48 c7 c2 a5 01 00 a0 be 01 00 00 00 48 c7 c7 e0 84 e2 81 <48> 8b 98 10 07 00 00 e8 91 8f 0a e1 e8 +3c 4e 07 e1 48 83 c4 18 RIP [] rpc_net_ns+0x53/0x70 [sunrpc] RSP Signed-off-by: J. Bruce Fields Signed-off-by: Trond Myklebust --- net/sunrpc/auth_gss/auth_gss.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'net') diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c index fcac5d141717..084656671d6e 100644 --- a/net/sunrpc/auth_gss/auth_gss.c +++ b/net/sunrpc/auth_gss/auth_gss.c @@ -1075,6 +1075,15 @@ gss_destroy(struct rpc_auth *auth) kref_put(&gss_auth->kref, gss_free_callback); } +/* + * Auths may be shared between rpc clients that were cloned from a + * common client with the same xprt, if they also share the flavor and + * target_name. + * + * The auth is looked up from the oldest parent sharing the same + * cl_xprt, and the auth itself references only that common parent + * (which is guaranteed to last as long as any of its descendants). + */ static struct gss_auth * gss_auth_find_or_add_hashed(struct rpc_auth_create_args *args, struct rpc_clnt *clnt, @@ -1088,6 +1097,8 @@ gss_auth_find_or_add_hashed(struct rpc_auth_create_args *args, gss_auth, hash, hashval) { + if (gss_auth->client != clnt) + continue; if (gss_auth->rpc_auth.au_flavor != args->pseudoflavor) continue; if (gss_auth->target_name != args->target_name) { -- cgit v1.2.3-58-ga151 From bd784a140712fd06674f2240eecfc4ccae421129 Mon Sep 17 00:00:00 2001 From: Duan Jiong Date: Wed, 18 Sep 2013 20:03:27 +0800 Subject: net:dccp: do not report ICMP redirects to user space DCCP shouldn't be setting sk_err on redirects as it isn't an error condition. it should be doing exactly what tcp is doing and leaving the error handler without touching the socket. Signed-off-by: Duan Jiong Signed-off-by: David S. Miller --- net/dccp/ipv6.c | 1 + 1 file changed, 1 insertion(+) (limited to 'net') diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index 9c61f9c02fdb..6cf9f7782ad4 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -135,6 +135,7 @@ static void dccp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (dst) dst->ops->redirect(dst, sk, skb); + goto out; } if (type == ICMPV6_PKT_TOOBIG) { -- cgit v1.2.3-58-ga151 From 5e130367d43ff22836bbae380d197d600fe8ddbb Mon Sep 17 00:00:00 2001 From: Johan Hedberg Date: Fri, 13 Sep 2013 08:58:17 +0300 Subject: Bluetooth: Introduce a new HCI_RFKILLED flag This makes it more convenient to check for rfkill (no need to check for dev->rfkill before calling rfkill_blocked()) and also avoids potential races if the RFKILL state needs to be checked from within the rfkill callback. Signed-off-by: Johan Hedberg Cc: stable@vger.kernel.org Acked-by: Marcel Holtmann Signed-off-by: Gustavo Padovan --- include/net/bluetooth/hci.h | 1 + net/bluetooth/hci_core.c | 15 ++++++++++----- 2 files changed, 11 insertions(+), 5 deletions(-) (limited to 'net') diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h index aaeaf0938ec0..15f10841e2b5 100644 --- a/include/net/bluetooth/hci.h +++ b/include/net/bluetooth/hci.h @@ -104,6 +104,7 @@ enum { enum { HCI_SETUP, HCI_AUTO_OFF, + HCI_RFKILLED, HCI_MGMT, HCI_PAIRABLE, HCI_SERVICE_CACHE, diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c index 634debab4d54..0d5bc246b607 100644 --- a/net/bluetooth/hci_core.c +++ b/net/bluetooth/hci_core.c @@ -1146,7 +1146,7 @@ int hci_dev_open(__u16 dev) goto done; } - if (hdev->rfkill && rfkill_blocked(hdev->rfkill)) { + if (test_bit(HCI_RFKILLED, &hdev->dev_flags)) { ret = -ERFKILL; goto done; } @@ -1566,10 +1566,12 @@ static int hci_rfkill_set_block(void *data, bool blocked) BT_DBG("%p name %s blocked %d", hdev, hdev->name, blocked); - if (!blocked) - return 0; - - hci_dev_do_close(hdev); + if (blocked) { + set_bit(HCI_RFKILLED, &hdev->dev_flags); + hci_dev_do_close(hdev); + } else { + clear_bit(HCI_RFKILLED, &hdev->dev_flags); +} return 0; } @@ -2209,6 +2211,9 @@ int hci_register_dev(struct hci_dev *hdev) } } + if (hdev->rfkill && rfkill_blocked(hdev->rfkill)) + set_bit(HCI_RFKILLED, &hdev->dev_flags); + set_bit(HCI_SETUP, &hdev->dev_flags); if (hdev->dev_type != HCI_AMP) -- cgit v1.2.3-58-ga151 From bf5430360ebe4b2d0c51d91f782e649107b502eb Mon Sep 17 00:00:00 2001 From: Johan Hedberg Date: Fri, 13 Sep 2013 08:58:18 +0300 Subject: Bluetooth: Fix rfkill functionality during the HCI setup stage We need to let the setup stage complete cleanly even when the HCI device is rfkilled. Otherwise the HCI device will stay in an undefined state and never get notified to user space through mgmt (even when it gets unblocked through rfkill). This patch makes sure that hci_dev_open() can be called in the HCI_SETUP stage, that blocking the device doesn't abort the setup stage, and that the device gets proper powered down as soon as the setup stage completes in case it was blocked meanwhile. The bug that this patch fixed can be very easily reproduced using e.g. the rfkill command line too. By running "rfkill block all" before inserting a Bluetooth dongle the resulting HCI device goes into a state where it is never announced over mgmt, not even when "rfkill unblock all" is run. Signed-off-by: Johan Hedberg Cc: stable@vger.kernel.org Acked-by: Marcel Holtmann Signed-off-by: Gustavo Padovan --- net/bluetooth/hci_core.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) (limited to 'net') diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c index 0d5bc246b607..1b66547a3ca6 100644 --- a/net/bluetooth/hci_core.c +++ b/net/bluetooth/hci_core.c @@ -1146,7 +1146,11 @@ int hci_dev_open(__u16 dev) goto done; } - if (test_bit(HCI_RFKILLED, &hdev->dev_flags)) { + /* Check for rfkill but allow the HCI setup stage to proceed + * (which in itself doesn't cause any RF activity). + */ + if (test_bit(HCI_RFKILLED, &hdev->dev_flags) && + !test_bit(HCI_SETUP, &hdev->dev_flags)) { ret = -ERFKILL; goto done; } @@ -1568,7 +1572,8 @@ static int hci_rfkill_set_block(void *data, bool blocked) if (blocked) { set_bit(HCI_RFKILLED, &hdev->dev_flags); - hci_dev_do_close(hdev); + if (!test_bit(HCI_SETUP, &hdev->dev_flags)) + hci_dev_do_close(hdev); } else { clear_bit(HCI_RFKILLED, &hdev->dev_flags); } @@ -1593,9 +1598,13 @@ static void hci_power_on(struct work_struct *work) return; } - if (test_bit(HCI_AUTO_OFF, &hdev->dev_flags)) + if (test_bit(HCI_RFKILLED, &hdev->dev_flags)) { + clear_bit(HCI_AUTO_OFF, &hdev->dev_flags); + hci_dev_do_close(hdev); + } else if (test_bit(HCI_AUTO_OFF, &hdev->dev_flags)) { queue_delayed_work(hdev->req_workqueue, &hdev->power_off, HCI_AUTO_OFF_TIMEOUT); + } if (test_and_clear_bit(HCI_SETUP, &hdev->dev_flags)) mgmt_index_added(hdev); -- cgit v1.2.3-58-ga151 From c16526a7b99c1c28e9670a8c8e3dbcf741bb32be Mon Sep 17 00:00:00 2001 From: Simon Kirby Date: Sat, 10 Aug 2013 01:26:18 -0700 Subject: ipvs: fix overflow on dest weight multiply Schedulers such as lblc and lblcr require the weight to be as high as the maximum number of active connections. In commit b552f7e3a9524abcbcdf ("ipvs: unify the formula to estimate the overhead of processing connections"), the consideration of inactconns and activeconns was cleaned up to always count activeconns as 256 times more important than inactconns. In cases where 3000 or more connections are expected, a weight of 3000 * 256 * 3000 connections overflows the 32-bit signed result used to determine if rescheduling is required. On amd64, this merely changes the multiply and comparison instructions to 64-bit. On x86, a 64-bit result is already present from imull, so only a few more comparison instructions are emitted. Signed-off-by: Simon Kirby Acked-by: Julian Anastasov Signed-off-by: Simon Horman --- include/net/ip_vs.h | 2 +- net/netfilter/ipvs/ip_vs_lblc.c | 4 ++-- net/netfilter/ipvs/ip_vs_lblcr.c | 12 ++++++------ net/netfilter/ipvs/ip_vs_nq.c | 8 ++++---- net/netfilter/ipvs/ip_vs_sed.c | 8 ++++---- net/netfilter/ipvs/ip_vs_wlc.c | 6 +++--- 6 files changed, 20 insertions(+), 20 deletions(-) (limited to 'net') diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h index f0d70f066f3d..fe782ed2fe72 100644 --- a/include/net/ip_vs.h +++ b/include/net/ip_vs.h @@ -1649,7 +1649,7 @@ static inline void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp) /* CONFIG_IP_VS_NFCT */ #endif -static inline unsigned int +static inline int ip_vs_dest_conn_overhead(struct ip_vs_dest *dest) { /* diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c index 1383b0eadc0e..eb814bf74e64 100644 --- a/net/netfilter/ipvs/ip_vs_lblc.c +++ b/net/netfilter/ipvs/ip_vs_lblc.c @@ -443,8 +443,8 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc) continue; doh = ip_vs_dest_conn_overhead(dest); - if (loh * atomic_read(&dest->weight) > - doh * atomic_read(&least->weight)) { + if ((__s64)loh * atomic_read(&dest->weight) > + (__s64)doh * atomic_read(&least->weight)) { least = dest; loh = doh; } diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c index 5199448697f6..e65f7c573090 100644 --- a/net/netfilter/ipvs/ip_vs_lblcr.c +++ b/net/netfilter/ipvs/ip_vs_lblcr.c @@ -200,8 +200,8 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set) continue; doh = ip_vs_dest_conn_overhead(dest); - if ((loh * atomic_read(&dest->weight) > - doh * atomic_read(&least->weight)) + if (((__s64)loh * atomic_read(&dest->weight) > + (__s64)doh * atomic_read(&least->weight)) && (dest->flags & IP_VS_DEST_F_AVAILABLE)) { least = dest; loh = doh; @@ -246,8 +246,8 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set) dest = rcu_dereference_protected(e->dest, 1); doh = ip_vs_dest_conn_overhead(dest); /* moh/mw < doh/dw ==> moh*dw < doh*mw, where mw,dw>0 */ - if ((moh * atomic_read(&dest->weight) < - doh * atomic_read(&most->weight)) + if (((__s64)moh * atomic_read(&dest->weight) < + (__s64)doh * atomic_read(&most->weight)) && (atomic_read(&dest->weight) > 0)) { most = dest; moh = doh; @@ -611,8 +611,8 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc) continue; doh = ip_vs_dest_conn_overhead(dest); - if (loh * atomic_read(&dest->weight) > - doh * atomic_read(&least->weight)) { + if ((__s64)loh * atomic_read(&dest->weight) > + (__s64)doh * atomic_read(&least->weight)) { least = dest; loh = doh; } diff --git a/net/netfilter/ipvs/ip_vs_nq.c b/net/netfilter/ipvs/ip_vs_nq.c index d8d9860934fe..961a6de9bb29 100644 --- a/net/netfilter/ipvs/ip_vs_nq.c +++ b/net/netfilter/ipvs/ip_vs_nq.c @@ -40,7 +40,7 @@ #include -static inline unsigned int +static inline int ip_vs_nq_dest_overhead(struct ip_vs_dest *dest) { /* @@ -59,7 +59,7 @@ ip_vs_nq_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, struct ip_vs_iphdr *iph) { struct ip_vs_dest *dest, *least = NULL; - unsigned int loh = 0, doh; + int loh = 0, doh; IP_VS_DBG(6, "%s(): Scheduling...\n", __func__); @@ -92,8 +92,8 @@ ip_vs_nq_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, } if (!least || - (loh * atomic_read(&dest->weight) > - doh * atomic_read(&least->weight))) { + ((__s64)loh * atomic_read(&dest->weight) > + (__s64)doh * atomic_read(&least->weight))) { least = dest; loh = doh; } diff --git a/net/netfilter/ipvs/ip_vs_sed.c b/net/netfilter/ipvs/ip_vs_sed.c index a5284cc3d882..e446b9fa7424 100644 --- a/net/netfilter/ipvs/ip_vs_sed.c +++ b/net/netfilter/ipvs/ip_vs_sed.c @@ -44,7 +44,7 @@ #include -static inline unsigned int +static inline int ip_vs_sed_dest_overhead(struct ip_vs_dest *dest) { /* @@ -63,7 +63,7 @@ ip_vs_sed_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, struct ip_vs_iphdr *iph) { struct ip_vs_dest *dest, *least; - unsigned int loh, doh; + int loh, doh; IP_VS_DBG(6, "%s(): Scheduling...\n", __func__); @@ -99,8 +99,8 @@ ip_vs_sed_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, if (dest->flags & IP_VS_DEST_F_OVERLOAD) continue; doh = ip_vs_sed_dest_overhead(dest); - if (loh * atomic_read(&dest->weight) > - doh * atomic_read(&least->weight)) { + if ((__s64)loh * atomic_read(&dest->weight) > + (__s64)doh * atomic_read(&least->weight)) { least = dest; loh = doh; } diff --git a/net/netfilter/ipvs/ip_vs_wlc.c b/net/netfilter/ipvs/ip_vs_wlc.c index 6dc1fa128840..b5b4650d50a9 100644 --- a/net/netfilter/ipvs/ip_vs_wlc.c +++ b/net/netfilter/ipvs/ip_vs_wlc.c @@ -35,7 +35,7 @@ ip_vs_wlc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, struct ip_vs_iphdr *iph) { struct ip_vs_dest *dest, *least; - unsigned int loh, doh; + int loh, doh; IP_VS_DBG(6, "ip_vs_wlc_schedule(): Scheduling...\n"); @@ -71,8 +71,8 @@ ip_vs_wlc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, if (dest->flags & IP_VS_DEST_F_OVERLOAD) continue; doh = ip_vs_dest_conn_overhead(dest); - if (loh * atomic_read(&dest->weight) > - doh * atomic_read(&least->weight)) { + if ((__s64)loh * atomic_read(&dest->weight) > + (__s64)doh * atomic_read(&least->weight)) { least = dest; loh = doh; } -- cgit v1.2.3-58-ga151 From bcbde4c0a7556cca72874c5e1efa4dccb5198a2b Mon Sep 17 00:00:00 2001 From: Julian Anastasov Date: Thu, 12 Sep 2013 11:21:07 +0300 Subject: ipvs: make the service replacement more robust commit 578bc3ef1e473a ("ipvs: reorganize dest trash") added IP_VS_DEST_STATE_REMOVING flag and RCU callback named ip_vs_dest_wait_readers() to keep dests and services after removal for at least a RCU grace period. But we have the following corner cases: - we can not reuse the same dest if its service is removed while IP_VS_DEST_STATE_REMOVING is still set because another dest removal in the first grace period can not extend this period. It can happen when ipvsadm -C && ipvsadm -R is used. - dest->svc can be replaced but ip_vs_in_stats() and ip_vs_out_stats() have no explicit read memory barriers when accessing dest->svc. It can happen that dest->svc was just freed (replaced) while we use it to update the stats. We solve the problems as follows: - IP_VS_DEST_STATE_REMOVING is removed and we ensure a fixed idle period for the dest (IP_VS_DEST_TRASH_PERIOD). idle_start will remember when for first time after deletion we noticed dest->refcnt=0. Later, the connections can grab a reference while in RCU grace period but if refcnt becomes 0 we can safely free the dest and its svc. - dest->svc becomes RCU pointer. As result, we add explicit RCU locking in ip_vs_in_stats() and ip_vs_out_stats(). - __ip_vs_unbind_svc is renamed to __ip_vs_svc_put(), it now can free the service immediately or after a RCU grace period. dest->svc is not set to NULL anymore. As result, unlinked dests and their services are freed always after IP_VS_DEST_TRASH_PERIOD period, unused services are freed after a RCU grace period. Signed-off-by: Julian Anastasov Signed-off-by: Simon Horman --- include/net/ip_vs.h | 7 +--- net/netfilter/ipvs/ip_vs_core.c | 12 +++++- net/netfilter/ipvs/ip_vs_ctl.c | 86 +++++++++++++++++------------------------ 3 files changed, 47 insertions(+), 58 deletions(-) (limited to 'net') diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h index fe782ed2fe72..9c4d37ec45a1 100644 --- a/include/net/ip_vs.h +++ b/include/net/ip_vs.h @@ -723,8 +723,6 @@ struct ip_vs_dest_dst { struct rcu_head rcu_head; }; -/* In grace period after removing */ -#define IP_VS_DEST_STATE_REMOVING 0x01 /* * The real server destination forwarding entry * with ip address, port number, and so on. @@ -742,7 +740,7 @@ struct ip_vs_dest { atomic_t refcnt; /* reference counter */ struct ip_vs_stats stats; /* statistics */ - unsigned long state; /* state flags */ + unsigned long idle_start; /* start time, jiffies */ /* connection counters and thresholds */ atomic_t activeconns; /* active connections */ @@ -756,14 +754,13 @@ struct ip_vs_dest { struct ip_vs_dest_dst __rcu *dest_dst; /* cached dst info */ /* for virtual service */ - struct ip_vs_service *svc; /* service it belongs to */ + struct ip_vs_service __rcu *svc; /* service it belongs to */ __u16 protocol; /* which protocol (TCP/UDP) */ __be16 vport; /* virtual port number */ union nf_inet_addr vaddr; /* virtual IP address */ __u32 vfwmark; /* firewall mark of service */ struct list_head t_list; /* in dest_trash */ - struct rcu_head rcu_head; unsigned int in_rs_table:1; /* we are in rs_table */ }; diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c index 4f69e83ff836..74fd00c27210 100644 --- a/net/netfilter/ipvs/ip_vs_core.c +++ b/net/netfilter/ipvs/ip_vs_core.c @@ -116,6 +116,7 @@ ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb) if (dest && (dest->flags & IP_VS_DEST_F_AVAILABLE)) { struct ip_vs_cpu_stats *s; + struct ip_vs_service *svc; s = this_cpu_ptr(dest->stats.cpustats); s->ustats.inpkts++; @@ -123,11 +124,14 @@ ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb) s->ustats.inbytes += skb->len; u64_stats_update_end(&s->syncp); - s = this_cpu_ptr(dest->svc->stats.cpustats); + rcu_read_lock(); + svc = rcu_dereference(dest->svc); + s = this_cpu_ptr(svc->stats.cpustats); s->ustats.inpkts++; u64_stats_update_begin(&s->syncp); s->ustats.inbytes += skb->len; u64_stats_update_end(&s->syncp); + rcu_read_unlock(); s = this_cpu_ptr(ipvs->tot_stats.cpustats); s->ustats.inpkts++; @@ -146,6 +150,7 @@ ip_vs_out_stats(struct ip_vs_conn *cp, struct sk_buff *skb) if (dest && (dest->flags & IP_VS_DEST_F_AVAILABLE)) { struct ip_vs_cpu_stats *s; + struct ip_vs_service *svc; s = this_cpu_ptr(dest->stats.cpustats); s->ustats.outpkts++; @@ -153,11 +158,14 @@ ip_vs_out_stats(struct ip_vs_conn *cp, struct sk_buff *skb) s->ustats.outbytes += skb->len; u64_stats_update_end(&s->syncp); - s = this_cpu_ptr(dest->svc->stats.cpustats); + rcu_read_lock(); + svc = rcu_dereference(dest->svc); + s = this_cpu_ptr(svc->stats.cpustats); s->ustats.outpkts++; u64_stats_update_begin(&s->syncp); s->ustats.outbytes += skb->len; u64_stats_update_end(&s->syncp); + rcu_read_unlock(); s = this_cpu_ptr(ipvs->tot_stats.cpustats); s->ustats.outpkts++; diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c index c8148e487386..a3df9bddc4f7 100644 --- a/net/netfilter/ipvs/ip_vs_ctl.c +++ b/net/netfilter/ipvs/ip_vs_ctl.c @@ -460,7 +460,7 @@ static inline void __ip_vs_bind_svc(struct ip_vs_dest *dest, struct ip_vs_service *svc) { atomic_inc(&svc->refcnt); - dest->svc = svc; + rcu_assign_pointer(dest->svc, svc); } static void ip_vs_service_free(struct ip_vs_service *svc) @@ -470,18 +470,25 @@ static void ip_vs_service_free(struct ip_vs_service *svc) kfree(svc); } -static void -__ip_vs_unbind_svc(struct ip_vs_dest *dest) +static void ip_vs_service_rcu_free(struct rcu_head *head) { - struct ip_vs_service *svc = dest->svc; + struct ip_vs_service *svc; + + svc = container_of(head, struct ip_vs_service, rcu_head); + ip_vs_service_free(svc); +} - dest->svc = NULL; +static void __ip_vs_svc_put(struct ip_vs_service *svc, bool do_delay) +{ if (atomic_dec_and_test(&svc->refcnt)) { IP_VS_DBG_BUF(3, "Removing service %u/%s:%u\n", svc->fwmark, IP_VS_DBG_ADDR(svc->af, &svc->addr), ntohs(svc->port)); - ip_vs_service_free(svc); + if (do_delay) + call_rcu(&svc->rcu_head, ip_vs_service_rcu_free); + else + ip_vs_service_free(svc); } } @@ -667,11 +674,6 @@ ip_vs_trash_get_dest(struct ip_vs_service *svc, const union nf_inet_addr *daddr, IP_VS_DBG_ADDR(svc->af, &dest->addr), ntohs(dest->port), atomic_read(&dest->refcnt)); - /* We can not reuse dest while in grace period - * because conns still can use dest->svc - */ - if (test_bit(IP_VS_DEST_STATE_REMOVING, &dest->state)) - continue; if (dest->af == svc->af && ip_vs_addr_equal(svc->af, &dest->addr, daddr) && dest->port == dport && @@ -697,8 +699,10 @@ out: static void ip_vs_dest_free(struct ip_vs_dest *dest) { + struct ip_vs_service *svc = rcu_dereference_protected(dest->svc, 1); + __ip_vs_dst_cache_reset(dest); - __ip_vs_unbind_svc(dest); + __ip_vs_svc_put(svc, false); free_percpu(dest->stats.cpustats); kfree(dest); } @@ -771,6 +775,7 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest, struct ip_vs_dest_user_kern *udest, int add) { struct netns_ipvs *ipvs = net_ipvs(svc->net); + struct ip_vs_service *old_svc; struct ip_vs_scheduler *sched; int conn_flags; @@ -792,13 +797,14 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest, atomic_set(&dest->conn_flags, conn_flags); /* bind the service */ - if (!dest->svc) { + old_svc = rcu_dereference_protected(dest->svc, 1); + if (!old_svc) { __ip_vs_bind_svc(dest, svc); } else { - if (dest->svc != svc) { - __ip_vs_unbind_svc(dest); + if (old_svc != svc) { ip_vs_zero_stats(&dest->stats); __ip_vs_bind_svc(dest, svc); + __ip_vs_svc_put(old_svc, true); } } @@ -998,16 +1004,6 @@ ip_vs_edit_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest) return 0; } -static void ip_vs_dest_wait_readers(struct rcu_head *head) -{ - struct ip_vs_dest *dest = container_of(head, struct ip_vs_dest, - rcu_head); - - /* End of grace period after unlinking */ - clear_bit(IP_VS_DEST_STATE_REMOVING, &dest->state); -} - - /* * Delete a destination (must be already unlinked from the service) */ @@ -1023,20 +1019,16 @@ static void __ip_vs_del_dest(struct net *net, struct ip_vs_dest *dest, */ ip_vs_rs_unhash(dest); - if (!cleanup) { - set_bit(IP_VS_DEST_STATE_REMOVING, &dest->state); - call_rcu(&dest->rcu_head, ip_vs_dest_wait_readers); - } - spin_lock_bh(&ipvs->dest_trash_lock); IP_VS_DBG_BUF(3, "Moving dest %s:%u into trash, dest->refcnt=%d\n", IP_VS_DBG_ADDR(dest->af, &dest->addr), ntohs(dest->port), atomic_read(&dest->refcnt)); if (list_empty(&ipvs->dest_trash) && !cleanup) mod_timer(&ipvs->dest_trash_timer, - jiffies + IP_VS_DEST_TRASH_PERIOD); + jiffies + (IP_VS_DEST_TRASH_PERIOD >> 1)); /* dest lives in trash without reference */ list_add(&dest->t_list, &ipvs->dest_trash); + dest->idle_start = 0; spin_unlock_bh(&ipvs->dest_trash_lock); ip_vs_dest_put(dest); } @@ -1108,24 +1100,30 @@ static void ip_vs_dest_trash_expire(unsigned long data) struct net *net = (struct net *) data; struct netns_ipvs *ipvs = net_ipvs(net); struct ip_vs_dest *dest, *next; + unsigned long now = jiffies; spin_lock(&ipvs->dest_trash_lock); list_for_each_entry_safe(dest, next, &ipvs->dest_trash, t_list) { - /* Skip if dest is in grace period */ - if (test_bit(IP_VS_DEST_STATE_REMOVING, &dest->state)) - continue; if (atomic_read(&dest->refcnt) > 0) continue; + if (dest->idle_start) { + if (time_before(now, dest->idle_start + + IP_VS_DEST_TRASH_PERIOD)) + continue; + } else { + dest->idle_start = max(1UL, now); + continue; + } IP_VS_DBG_BUF(3, "Removing destination %u/%s:%u from trash\n", dest->vfwmark, - IP_VS_DBG_ADDR(dest->svc->af, &dest->addr), + IP_VS_DBG_ADDR(dest->af, &dest->addr), ntohs(dest->port)); list_del(&dest->t_list); ip_vs_dest_free(dest); } if (!list_empty(&ipvs->dest_trash)) mod_timer(&ipvs->dest_trash_timer, - jiffies + IP_VS_DEST_TRASH_PERIOD); + jiffies + (IP_VS_DEST_TRASH_PERIOD >> 1)); spin_unlock(&ipvs->dest_trash_lock); } @@ -1320,14 +1318,6 @@ out: return ret; } -static void ip_vs_service_rcu_free(struct rcu_head *head) -{ - struct ip_vs_service *svc; - - svc = container_of(head, struct ip_vs_service, rcu_head); - ip_vs_service_free(svc); -} - /* * Delete a service from the service list * - The service must be unlinked, unlocked and not referenced! @@ -1376,13 +1366,7 @@ static void __ip_vs_del_service(struct ip_vs_service *svc, bool cleanup) /* * Free the service if nobody refers to it */ - if (atomic_dec_and_test(&svc->refcnt)) { - IP_VS_DBG_BUF(3, "Removing service %u/%s:%u\n", - svc->fwmark, - IP_VS_DBG_ADDR(svc->af, &svc->addr), - ntohs(svc->port)); - call_rcu(&svc->rcu_head, ip_vs_service_rcu_free); - } + __ip_vs_svc_put(svc, true); /* decrease the module use count */ ip_vs_use_count_dec(); -- cgit v1.2.3-58-ga151 From 2f3d771a35fee21a1f17364b46b3c8cc66dc6892 Mon Sep 17 00:00:00 2001 From: Julian Anastasov Date: Thu, 12 Sep 2013 11:21:08 +0300 Subject: ipvs: do not use dest after ip_vs_dest_put in LBLC commit c2a4ffb70eef39 ("ipvs: convert lblc scheduler to rcu") allows RCU readers to use dest after calling ip_vs_dest_put(). In the corner case it can race with ip_vs_dest_trash_expire() which can release the dest while it is being returned to the RCU readers as scheduling result. To fix the problem do not allow en->dest to be replaced and defer the ip_vs_dest_put() call by using RCU callback. Now en->dest does not need to be RCU pointer. Signed-off-by: Julian Anastasov Signed-off-by: Simon Horman --- net/netfilter/ipvs/ip_vs_lblc.c | 68 +++++++++++++++++++---------------------- 1 file changed, 31 insertions(+), 37 deletions(-) (limited to 'net') diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c index eb814bf74e64..eff13c94498e 100644 --- a/net/netfilter/ipvs/ip_vs_lblc.c +++ b/net/netfilter/ipvs/ip_vs_lblc.c @@ -93,7 +93,7 @@ struct ip_vs_lblc_entry { struct hlist_node list; int af; /* address family */ union nf_inet_addr addr; /* destination IP address */ - struct ip_vs_dest __rcu *dest; /* real server (cache) */ + struct ip_vs_dest *dest; /* real server (cache) */ unsigned long lastuse; /* last used time */ struct rcu_head rcu_head; }; @@ -130,20 +130,21 @@ static struct ctl_table vs_vars_table[] = { }; #endif -static inline void ip_vs_lblc_free(struct ip_vs_lblc_entry *en) +static void ip_vs_lblc_rcu_free(struct rcu_head *head) { - struct ip_vs_dest *dest; + struct ip_vs_lblc_entry *en = container_of(head, + struct ip_vs_lblc_entry, + rcu_head); - hlist_del_rcu(&en->list); - /* - * We don't kfree dest because it is referred either by its service - * or the trash dest list. - */ - dest = rcu_dereference_protected(en->dest, 1); - ip_vs_dest_put(dest); - kfree_rcu(en, rcu_head); + ip_vs_dest_put(en->dest); + kfree(en); } +static inline void ip_vs_lblc_del(struct ip_vs_lblc_entry *en) +{ + hlist_del_rcu(&en->list); + call_rcu(&en->rcu_head, ip_vs_lblc_rcu_free); +} /* * Returns hash value for IPVS LBLC entry @@ -203,30 +204,23 @@ ip_vs_lblc_new(struct ip_vs_lblc_table *tbl, const union nf_inet_addr *daddr, struct ip_vs_lblc_entry *en; en = ip_vs_lblc_get(dest->af, tbl, daddr); - if (!en) { - en = kmalloc(sizeof(*en), GFP_ATOMIC); - if (!en) - return NULL; - - en->af = dest->af; - ip_vs_addr_copy(dest->af, &en->addr, daddr); - en->lastuse = jiffies; + if (en) { + if (en->dest == dest) + return en; + ip_vs_lblc_del(en); + } + en = kmalloc(sizeof(*en), GFP_ATOMIC); + if (!en) + return NULL; - ip_vs_dest_hold(dest); - RCU_INIT_POINTER(en->dest, dest); + en->af = dest->af; + ip_vs_addr_copy(dest->af, &en->addr, daddr); + en->lastuse = jiffies; - ip_vs_lblc_hash(tbl, en); - } else { - struct ip_vs_dest *old_dest; + ip_vs_dest_hold(dest); + en->dest = dest; - old_dest = rcu_dereference_protected(en->dest, 1); - if (old_dest != dest) { - ip_vs_dest_put(old_dest); - ip_vs_dest_hold(dest); - /* No ordering constraints for refcnt */ - RCU_INIT_POINTER(en->dest, dest); - } - } + ip_vs_lblc_hash(tbl, en); return en; } @@ -246,7 +240,7 @@ static void ip_vs_lblc_flush(struct ip_vs_service *svc) tbl->dead = 1; for (i=0; ibucket[i], list) { - ip_vs_lblc_free(en); + ip_vs_lblc_del(en); atomic_dec(&tbl->entries); } } @@ -281,7 +275,7 @@ static inline void ip_vs_lblc_full_check(struct ip_vs_service *svc) sysctl_lblc_expiration(svc))) continue; - ip_vs_lblc_free(en); + ip_vs_lblc_del(en); atomic_dec(&tbl->entries); } spin_unlock(&svc->sched_lock); @@ -335,7 +329,7 @@ static void ip_vs_lblc_check_expire(unsigned long data) if (time_before(now, en->lastuse + ENTRY_TIMEOUT)) continue; - ip_vs_lblc_free(en); + ip_vs_lblc_del(en); atomic_dec(&tbl->entries); goal--; } @@ -511,7 +505,7 @@ ip_vs_lblc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, * free up entries from the trash at any time. */ - dest = rcu_dereference(en->dest); + dest = en->dest; if ((dest->flags & IP_VS_DEST_F_AVAILABLE) && atomic_read(&dest->weight) > 0 && !is_overloaded(dest, svc)) goto out; @@ -631,7 +625,7 @@ static void __exit ip_vs_lblc_cleanup(void) { unregister_ip_vs_scheduler(&ip_vs_lblc_scheduler); unregister_pernet_subsys(&ip_vs_lblc_ops); - synchronize_rcu(); + rcu_barrier(); } -- cgit v1.2.3-58-ga151 From 742617b176909e586a4cf9b142c996c25986fce8 Mon Sep 17 00:00:00 2001 From: Julian Anastasov Date: Thu, 12 Sep 2013 11:21:09 +0300 Subject: ipvs: do not use dest after ip_vs_dest_put in LBLCR commit c5549571f975ab ("ipvs: convert lblcr scheduler to rcu") allows RCU readers to use dest after calling ip_vs_dest_put(). In the corner case it can race with ip_vs_dest_trash_expire() which can release the dest while it is being returned to the RCU readers as scheduling result. To fix the problem do not allow e->dest to be replaced and defer the ip_vs_dest_put() call by using RCU callback. Now e->dest does not need to be RCU pointer. Signed-off-by: Julian Anastasov Signed-off-by: Simon Horman --- net/netfilter/ipvs/ip_vs_lblcr.c | 50 ++++++++++++++++------------------------ 1 file changed, 20 insertions(+), 30 deletions(-) (limited to 'net') diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c index e65f7c573090..0b8550089a2e 100644 --- a/net/netfilter/ipvs/ip_vs_lblcr.c +++ b/net/netfilter/ipvs/ip_vs_lblcr.c @@ -89,7 +89,7 @@ */ struct ip_vs_dest_set_elem { struct list_head list; /* list link */ - struct ip_vs_dest __rcu *dest; /* destination server */ + struct ip_vs_dest *dest; /* destination server */ struct rcu_head rcu_head; }; @@ -107,11 +107,7 @@ static void ip_vs_dest_set_insert(struct ip_vs_dest_set *set, if (check) { list_for_each_entry(e, &set->list, list) { - struct ip_vs_dest *d; - - d = rcu_dereference_protected(e->dest, 1); - if (d == dest) - /* already existed */ + if (e->dest == dest) return; } } @@ -121,7 +117,7 @@ static void ip_vs_dest_set_insert(struct ip_vs_dest_set *set, return; ip_vs_dest_hold(dest); - RCU_INIT_POINTER(e->dest, dest); + e->dest = dest; list_add_rcu(&e->list, &set->list); atomic_inc(&set->size); @@ -129,22 +125,27 @@ static void ip_vs_dest_set_insert(struct ip_vs_dest_set *set, set->lastmod = jiffies; } +static void ip_vs_lblcr_elem_rcu_free(struct rcu_head *head) +{ + struct ip_vs_dest_set_elem *e; + + e = container_of(head, struct ip_vs_dest_set_elem, rcu_head); + ip_vs_dest_put(e->dest); + kfree(e); +} + static void ip_vs_dest_set_erase(struct ip_vs_dest_set *set, struct ip_vs_dest *dest) { struct ip_vs_dest_set_elem *e; list_for_each_entry(e, &set->list, list) { - struct ip_vs_dest *d; - - d = rcu_dereference_protected(e->dest, 1); - if (d == dest) { + if (e->dest == dest) { /* HIT */ atomic_dec(&set->size); set->lastmod = jiffies; - ip_vs_dest_put(dest); list_del_rcu(&e->list); - kfree_rcu(e, rcu_head); + call_rcu(&e->rcu_head, ip_vs_lblcr_elem_rcu_free); break; } } @@ -155,16 +156,8 @@ static void ip_vs_dest_set_eraseall(struct ip_vs_dest_set *set) struct ip_vs_dest_set_elem *e, *ep; list_for_each_entry_safe(e, ep, &set->list, list) { - struct ip_vs_dest *d; - - d = rcu_dereference_protected(e->dest, 1); - /* - * We don't kfree dest because it is referred either - * by its service or by the trash dest list. - */ - ip_vs_dest_put(d); list_del_rcu(&e->list); - kfree_rcu(e, rcu_head); + call_rcu(&e->rcu_head, ip_vs_lblcr_elem_rcu_free); } } @@ -175,12 +168,9 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set) struct ip_vs_dest *dest, *least; int loh, doh; - if (set == NULL) - return NULL; - /* select the first destination server, whose weight > 0 */ list_for_each_entry_rcu(e, &set->list, list) { - least = rcu_dereference(e->dest); + least = e->dest; if (least->flags & IP_VS_DEST_F_OVERLOAD) continue; @@ -195,7 +185,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set) /* find the destination with the weighted least load */ nextstage: list_for_each_entry_continue_rcu(e, &set->list, list) { - dest = rcu_dereference(e->dest); + dest = e->dest; if (dest->flags & IP_VS_DEST_F_OVERLOAD) continue; @@ -232,7 +222,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set) /* select the first destination server, whose weight > 0 */ list_for_each_entry(e, &set->list, list) { - most = rcu_dereference_protected(e->dest, 1); + most = e->dest; if (atomic_read(&most->weight) > 0) { moh = ip_vs_dest_conn_overhead(most); goto nextstage; @@ -243,7 +233,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set) /* find the destination with the weighted most load */ nextstage: list_for_each_entry_continue(e, &set->list, list) { - dest = rcu_dereference_protected(e->dest, 1); + dest = e->dest; doh = ip_vs_dest_conn_overhead(dest); /* moh/mw < doh/dw ==> moh*dw < doh*mw, where mw,dw>0 */ if (((__s64)moh * atomic_read(&dest->weight) < @@ -819,7 +809,7 @@ static void __exit ip_vs_lblcr_cleanup(void) { unregister_ip_vs_scheduler(&ip_vs_lblcr_scheduler); unregister_pernet_subsys(&ip_vs_lblcr_ops); - synchronize_rcu(); + rcu_barrier(); } -- cgit v1.2.3-58-ga151 From d1ee4fea0b6946dd8bc61b46db35ea80af7af34b Mon Sep 17 00:00:00 2001 From: Julian Anastasov Date: Thu, 12 Sep 2013 11:21:10 +0300 Subject: ipvs: stats should not depend on CPU 0 When reading percpu stats we need to properly reset the sum when CPU 0 is not present in the possible mask. Signed-off-by: Julian Anastasov Signed-off-by: Simon Horman --- net/netfilter/ipvs/ip_vs_est.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c index 6bee6d0c73a5..1425e9a924c4 100644 --- a/net/netfilter/ipvs/ip_vs_est.c +++ b/net/netfilter/ipvs/ip_vs_est.c @@ -59,12 +59,13 @@ static void ip_vs_read_cpu_stats(struct ip_vs_stats_user *sum, struct ip_vs_cpu_stats __percpu *stats) { int i; + bool add = false; for_each_possible_cpu(i) { struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i); unsigned int start; __u64 inbytes, outbytes; - if (i) { + if (add) { sum->conns += s->ustats.conns; sum->inpkts += s->ustats.inpkts; sum->outpkts += s->ustats.outpkts; @@ -76,6 +77,7 @@ static void ip_vs_read_cpu_stats(struct ip_vs_stats_user *sum, sum->inbytes += inbytes; sum->outbytes += outbytes; } else { + add = true; sum->conns = s->ustats.conns; sum->inpkts = s->ustats.inpkts; sum->outpkts = s->ustats.outpkts; -- cgit v1.2.3-58-ga151 From 749154aa56b57652a282cbde57a57abc278d1205 Mon Sep 17 00:00:00 2001 From: Ansis Atteka Date: Wed, 18 Sep 2013 15:29:52 -0700 Subject: ip: use ip_hdr() in __ip_make_skb() to retrieve IP header skb->data already points to IP header, but for the sake of consistency we can also use ip_hdr() to retrieve it. Signed-off-by: Ansis Atteka Signed-off-by: David S. Miller --- net/ipv4/ip_output.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 9ee17e3d11c3..eae2e262fbe5 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1316,7 +1316,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk, else ttl = ip_select_ttl(inet, &rt->dst); - iph = (struct iphdr *)skb->data; + iph = ip_hdr(skb); iph->version = 4; iph->ihl = 5; iph->tos = inet->tos; -- cgit v1.2.3-58-ga151 From 703133de331a7a7df47f31fb9de51dc6f68a9de8 Mon Sep 17 00:00:00 2001 From: Ansis Atteka Date: Wed, 18 Sep 2013 15:29:53 -0700 Subject: ip: generate unique IP identificator if local fragmentation is allowed If local fragmentation is allowed, then ip_select_ident() and ip_select_ident_more() need to generate unique IDs to ensure correct defragmentation on the peer. For example, if IPsec (tunnel mode) has to encrypt large skbs that have local_df bit set, then all IP fragments that belonged to different ESP datagrams would have used the same identificator. If one of these IP fragments would get lost or reordered, then peer could possibly stitch together wrong IP fragments that did not belong to the same datagram. This would lead to a packet loss or data corruption. Signed-off-by: Ansis Atteka Signed-off-by: David S. Miller --- drivers/net/ppp/pptp.c | 2 +- include/net/ip.h | 12 ++++++++---- net/ipv4/igmp.c | 4 ++-- net/ipv4/inetpeer.c | 4 ++-- net/ipv4/ip_output.c | 6 +++--- net/ipv4/ipmr.c | 2 +- net/ipv4/raw.c | 2 +- net/ipv4/xfrm4_mode_tunnel.c | 2 +- net/netfilter/ipvs/ip_vs_xmit.c | 2 +- 9 files changed, 20 insertions(+), 16 deletions(-) (limited to 'net') diff --git a/drivers/net/ppp/pptp.c b/drivers/net/ppp/pptp.c index 6fa5ae00039f..01805319e1e0 100644 --- a/drivers/net/ppp/pptp.c +++ b/drivers/net/ppp/pptp.c @@ -281,7 +281,7 @@ static int pptp_xmit(struct ppp_channel *chan, struct sk_buff *skb) nf_reset(skb); skb->ip_summed = CHECKSUM_NONE; - ip_select_ident(iph, &rt->dst, NULL); + ip_select_ident(skb, &rt->dst, NULL); ip_send_check(iph); ip_local_out(skb); diff --git a/include/net/ip.h b/include/net/ip.h index 48f55979d842..5e5268807a1c 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -264,9 +264,11 @@ int ip_dont_fragment(struct sock *sk, struct dst_entry *dst) extern void __ip_select_ident(struct iphdr *iph, struct dst_entry *dst, int more); -static inline void ip_select_ident(struct iphdr *iph, struct dst_entry *dst, struct sock *sk) +static inline void ip_select_ident(struct sk_buff *skb, struct dst_entry *dst, struct sock *sk) { - if (iph->frag_off & htons(IP_DF)) { + struct iphdr *iph = ip_hdr(skb); + + if ((iph->frag_off & htons(IP_DF)) && !skb->local_df) { /* This is only to work around buggy Windows95/2000 * VJ compression implementations. If the ID field * does not change, they drop every other packet in @@ -278,9 +280,11 @@ static inline void ip_select_ident(struct iphdr *iph, struct dst_entry *dst, str __ip_select_ident(iph, dst, 0); } -static inline void ip_select_ident_more(struct iphdr *iph, struct dst_entry *dst, struct sock *sk, int more) +static inline void ip_select_ident_more(struct sk_buff *skb, struct dst_entry *dst, struct sock *sk, int more) { - if (iph->frag_off & htons(IP_DF)) { + struct iphdr *iph = ip_hdr(skb); + + if ((iph->frag_off & htons(IP_DF)) && !skb->local_df) { if (sk && inet_sk(sk)->inet_daddr) { iph->id = htons(inet_sk(sk)->inet_id); inet_sk(sk)->inet_id += 1 + more; diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index d6c0e64ec97f..dace87f06e5f 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -369,7 +369,7 @@ static struct sk_buff *igmpv3_newpack(struct net_device *dev, int size) pip->saddr = fl4.saddr; pip->protocol = IPPROTO_IGMP; pip->tot_len = 0; /* filled in later */ - ip_select_ident(pip, &rt->dst, NULL); + ip_select_ident(skb, &rt->dst, NULL); ((u8 *)&pip[1])[0] = IPOPT_RA; ((u8 *)&pip[1])[1] = 4; ((u8 *)&pip[1])[2] = 0; @@ -714,7 +714,7 @@ static int igmp_send_report(struct in_device *in_dev, struct ip_mc_list *pmc, iph->daddr = dst; iph->saddr = fl4.saddr; iph->protocol = IPPROTO_IGMP; - ip_select_ident(iph, &rt->dst, NULL); + ip_select_ident(skb, &rt->dst, NULL); ((u8 *)&iph[1])[0] = IPOPT_RA; ((u8 *)&iph[1])[1] = 4; ((u8 *)&iph[1])[2] = 0; diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c index 000e3d239d64..33d5537881ed 100644 --- a/net/ipv4/inetpeer.c +++ b/net/ipv4/inetpeer.c @@ -32,8 +32,8 @@ * At the moment of writing this notes identifier of IP packets is generated * to be unpredictable using this code only for packets subjected * (actually or potentially) to defragmentation. I.e. DF packets less than - * PMTU in size uses a constant ID and do not use this code (see - * ip_select_ident() in include/net/ip.h). + * PMTU in size when local fragmentation is disabled use a constant ID and do + * not use this code (see ip_select_ident() in include/net/ip.h). * * Route cache entries hold references to our nodes. * New cache entries get references via lookup by destination IP address in diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index eae2e262fbe5..a04d872c54f9 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -148,7 +148,7 @@ int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk, iph->daddr = (opt && opt->opt.srr ? opt->opt.faddr : daddr); iph->saddr = saddr; iph->protocol = sk->sk_protocol; - ip_select_ident(iph, &rt->dst, sk); + ip_select_ident(skb, &rt->dst, sk); if (opt && opt->opt.optlen) { iph->ihl += opt->opt.optlen>>2; @@ -386,7 +386,7 @@ packet_routed: ip_options_build(skb, &inet_opt->opt, inet->inet_daddr, rt, 0); } - ip_select_ident_more(iph, &rt->dst, sk, + ip_select_ident_more(skb, &rt->dst, sk, (skb_shinfo(skb)->gso_segs ?: 1) - 1); skb->priority = sk->sk_priority; @@ -1324,7 +1324,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk, iph->ttl = ttl; iph->protocol = sk->sk_protocol; ip_copy_addrs(iph, fl4); - ip_select_ident(iph, &rt->dst, sk); + ip_select_ident(skb, &rt->dst, sk); if (opt) { iph->ihl += opt->optlen>>2; diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 9ae54b09254f..62212c772a4b 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1658,7 +1658,7 @@ static void ip_encap(struct sk_buff *skb, __be32 saddr, __be32 daddr) iph->protocol = IPPROTO_IPIP; iph->ihl = 5; iph->tot_len = htons(skb->len); - ip_select_ident(iph, skb_dst(skb), NULL); + ip_select_ident(skb, skb_dst(skb), NULL); ip_send_check(iph); memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt)); diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index a86c7ae71881..bfec521c717f 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -387,7 +387,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, iph->check = 0; iph->tot_len = htons(length); if (!iph->id) - ip_select_ident(iph, &rt->dst, NULL); + ip_select_ident(skb, &rt->dst, NULL); iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl); } diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c index eb1dd4d643f2..b5663c37f089 100644 --- a/net/ipv4/xfrm4_mode_tunnel.c +++ b/net/ipv4/xfrm4_mode_tunnel.c @@ -117,7 +117,7 @@ static int xfrm4_mode_tunnel_output(struct xfrm_state *x, struct sk_buff *skb) top_iph->frag_off = (flags & XFRM_STATE_NOPMTUDISC) ? 0 : (XFRM_MODE_SKB_CB(skb)->frag_off & htons(IP_DF)); - ip_select_ident(top_iph, dst->child, NULL); + ip_select_ident(skb, dst->child, NULL); top_iph->ttl = ip4_dst_hoplimit(dst->child); diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index b75ff6429a04..c47444e4cf8c 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -883,7 +883,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp, iph->daddr = cp->daddr.ip; iph->saddr = saddr; iph->ttl = old_iph->ttl; - ip_select_ident(iph, &rt->dst, NULL); + ip_select_ident(skb, &rt->dst, NULL); /* Another hack: avoid icmp_send in ip_fragment */ skb->local_df = 1; -- cgit v1.2.3-58-ga151 From d0fe8c888b1fd1a2f84b9962cabcb98a70988aec Mon Sep 17 00:00:00 2001 From: Nikolay Aleksandrov Date: Thu, 19 Sep 2013 15:02:35 +0200 Subject: netpoll: fix NULL pointer dereference in netpoll_cleanup I've been hitting a NULL ptr deref while using netconsole because the np->dev check and the pointer manipulation in netpoll_cleanup are done without rtnl and the following sequence happens when having a netconsole over a vlan and we remove the vlan while disabling the netconsole: CPU 1 CPU2 removes vlan and calls the notifier enters store_enabled(), calls netdev_cleanup which checks np->dev and then waits for rtnl executes the netconsole netdev release notifier making np->dev == NULL and releases rtnl continues to dereference a member of np->dev which at this point is == NULL Signed-off-by: Nikolay Aleksandrov Signed-off-by: David S. Miller --- net/core/netpoll.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'net') diff --git a/net/core/netpoll.c b/net/core/netpoll.c index c3c7b27c112d..fc75c9e461b8 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -1284,15 +1284,14 @@ EXPORT_SYMBOL_GPL(__netpoll_free_async); void netpoll_cleanup(struct netpoll *np) { - if (!np->dev) - return; - rtnl_lock(); + if (!np->dev) + goto out; __netpoll_cleanup(np); - rtnl_unlock(); - dev_put(np->dev); np->dev = NULL; +out: + rtnl_unlock(); } EXPORT_SYMBOL(netpoll_cleanup); -- cgit v1.2.3-58-ga151 From 29cd718beba999bda4bdbbf59b5a4d25c07e1547 Mon Sep 17 00:00:00 2001 From: Gianluca Anzolin Date: Tue, 27 Aug 2013 18:28:46 +0200 Subject: Bluetooth: don't release the port in rfcomm_dev_state_change() When the dlc is closed, rfcomm_dev_state_change() tries to release the port in the case it cannot get a reference to the tty. However this is racy and not even needed. Infact as Peter Hurley points out: 1. Only consider dlcs that are 'stolen' from a connected socket, ie. reused. Allocated dlcs cannot have been closed prior to port activate and so for these dlcs a tty reference will always be avail in rfcomm_dev_state_change() -- except for the conditions covered by #2b below. 2. If a tty was at some point previously created for this rfcomm, then either (a) the tty reference is still avail, so rfcomm_dev_state_change() will perform a hangup. So nothing to do, or, (b) the tty reference is no longer avail, and the tty_port will be destroyed by the last tty_port_put() in rfcomm_tty_cleanup. Again, no action required. 3. Prior to obtaining the dlc lock in rfcomm_dev_add(), rfcomm_dev_state_change() will not 'see' a rfcomm_dev so nothing to do here. 4. After releasing the dlc lock in rfcomm_dev_add(), rfcomm_dev_state_change() will 'see' an incomplete rfcomm_dev if a tty reference could not be obtained. Again, the best thing to do here is nothing. Any future attempted open() will block on rfcomm_dev_carrier_raised(). The unconnected device will exist until released by ioctl(RFCOMMRELEASEDEV). The patch removes the aforementioned code and uses the tty_port_tty_hangup() helper to hangup the tty. Signed-off-by: Gianluca Anzolin Reviewed-by: Peter Hurley Signed-off-by: Gustavo Padovan --- net/bluetooth/rfcomm/tty.c | 35 ++--------------------------------- 1 file changed, 2 insertions(+), 33 deletions(-) (limited to 'net') diff --git a/net/bluetooth/rfcomm/tty.c b/net/bluetooth/rfcomm/tty.c index 6d126faf145f..84fcf9fff3ea 100644 --- a/net/bluetooth/rfcomm/tty.c +++ b/net/bluetooth/rfcomm/tty.c @@ -569,7 +569,6 @@ static void rfcomm_dev_data_ready(struct rfcomm_dlc *dlc, struct sk_buff *skb) static void rfcomm_dev_state_change(struct rfcomm_dlc *dlc, int err) { struct rfcomm_dev *dev = dlc->owner; - struct tty_struct *tty; if (!dev) return; @@ -581,38 +580,8 @@ static void rfcomm_dev_state_change(struct rfcomm_dlc *dlc, int err) DPM_ORDER_DEV_AFTER_PARENT); wake_up_interruptible(&dev->port.open_wait); - } else if (dlc->state == BT_CLOSED) { - tty = tty_port_tty_get(&dev->port); - if (!tty) { - if (test_bit(RFCOMM_RELEASE_ONHUP, &dev->flags)) { - /* Drop DLC lock here to avoid deadlock - * 1. rfcomm_dev_get will take rfcomm_dev_lock - * but in rfcomm_dev_add there's lock order: - * rfcomm_dev_lock -> dlc lock - * 2. tty_port_put will deadlock if it's - * the last reference - * - * FIXME: when we release the lock anything - * could happen to dev, even its destruction - */ - rfcomm_dlc_unlock(dlc); - if (rfcomm_dev_get(dev->id) == NULL) { - rfcomm_dlc_lock(dlc); - return; - } - - if (!test_and_set_bit(RFCOMM_TTY_RELEASED, - &dev->flags)) - tty_port_put(&dev->port); - - tty_port_put(&dev->port); - rfcomm_dlc_lock(dlc); - } - } else { - tty_hangup(tty); - tty_kref_put(tty); - } - } + } else if (dlc->state == BT_CLOSED) + tty_port_tty_hangup(&dev->port, false); } static void rfcomm_dev_modem_status(struct rfcomm_dlc *dlc, u8 v24_sig) -- cgit v1.2.3-58-ga151 From a224bd36bf5ccc72d0f12ab11216706762133177 Mon Sep 17 00:00:00 2001 From: "josselin.costanzi@mobile-devices.fr" Date: Wed, 18 Sep 2013 12:00:35 +0200 Subject: net/lapb: re-send packets on timeout Actually re-send packets when the T1 timer runs out. This fixes a bug where packets are waiting on the write queue until disconnection when no other traffic is outstanding. Signed-off-by: Josselin Costanzi Signed-off-by: Maxime Jayat Signed-off-by: David S. Miller --- net/lapb/lapb_timer.c | 1 + 1 file changed, 1 insertion(+) (limited to 'net') diff --git a/net/lapb/lapb_timer.c b/net/lapb/lapb_timer.c index 54563ad8aeb1..355cc3b6fa4d 100644 --- a/net/lapb/lapb_timer.c +++ b/net/lapb/lapb_timer.c @@ -154,6 +154,7 @@ static void lapb_t1timer_expiry(unsigned long param) } else { lapb->n2count++; lapb_requeue_frames(lapb); + lapb_kick(lapb); } break; -- cgit v1.2.3-58-ga151 From 9fe34f5d920b183ec063550e0f4ec854aa373316 Mon Sep 17 00:00:00 2001 From: Noel Burton-Krahn Date: Wed, 18 Sep 2013 12:24:40 -0700 Subject: mrp: add periodictimer to allow retries when packets get lost MRP doesn't implement the periodictimer in 802.1Q, so it never retries if packets get lost. I ran into this problem when MRP sent a MVRP JoinIn before the interface was fully up. The JoinIn was lost, MRP didn't retry, and MVRP registration failed. Tested against Juniper QFabric switches Signed-off-by: Noel Burton-Krahn Acked-by: David Ward Signed-off-by: David S. Miller --- include/net/mrp.h | 1 + net/802/mrp.c | 27 +++++++++++++++++++++++++++ 2 files changed, 28 insertions(+) (limited to 'net') diff --git a/include/net/mrp.h b/include/net/mrp.h index 4fbf02aa2ec1..0f7558b638ae 100644 --- a/include/net/mrp.h +++ b/include/net/mrp.h @@ -112,6 +112,7 @@ struct mrp_applicant { struct mrp_application *app; struct net_device *dev; struct timer_list join_timer; + struct timer_list periodic_timer; spinlock_t lock; struct sk_buff_head queue; diff --git a/net/802/mrp.c b/net/802/mrp.c index 1eb05d80b07b..3ed616215870 100644 --- a/net/802/mrp.c +++ b/net/802/mrp.c @@ -24,6 +24,11 @@ static unsigned int mrp_join_time __read_mostly = 200; module_param(mrp_join_time, uint, 0644); MODULE_PARM_DESC(mrp_join_time, "Join time in ms (default 200ms)"); + +static unsigned int mrp_periodic_time __read_mostly = 1000; +module_param(mrp_periodic_time, uint, 0644); +MODULE_PARM_DESC(mrp_periodic_time, "Periodic time in ms (default 1s)"); + MODULE_LICENSE("GPL"); static const u8 @@ -595,6 +600,24 @@ static void mrp_join_timer(unsigned long data) mrp_join_timer_arm(app); } +static void mrp_periodic_timer_arm(struct mrp_applicant *app) +{ + mod_timer(&app->periodic_timer, + jiffies + msecs_to_jiffies(mrp_periodic_time)); +} + +static void mrp_periodic_timer(unsigned long data) +{ + struct mrp_applicant *app = (struct mrp_applicant *)data; + + spin_lock(&app->lock); + mrp_mad_event(app, MRP_EVENT_PERIODIC); + mrp_pdu_queue(app); + spin_unlock(&app->lock); + + mrp_periodic_timer_arm(app); +} + static int mrp_pdu_parse_end_mark(struct sk_buff *skb, int *offset) { __be16 endmark; @@ -845,6 +868,9 @@ int mrp_init_applicant(struct net_device *dev, struct mrp_application *appl) rcu_assign_pointer(dev->mrp_port->applicants[appl->type], app); setup_timer(&app->join_timer, mrp_join_timer, (unsigned long)app); mrp_join_timer_arm(app); + setup_timer(&app->periodic_timer, mrp_periodic_timer, + (unsigned long)app); + mrp_periodic_timer_arm(app); return 0; err3: @@ -870,6 +896,7 @@ void mrp_uninit_applicant(struct net_device *dev, struct mrp_application *appl) * all pending messages before the applicant is gone. */ del_timer_sync(&app->join_timer); + del_timer_sync(&app->periodic_timer); spin_lock_bh(&app->lock); mrp_mad_event(app, MRP_EVENT_TX); -- cgit v1.2.3-58-ga151 From 1a462d189280b560bd84af1407e4d848e262b3b3 Mon Sep 17 00:00:00 2001 From: Duan Jiong Date: Fri, 20 Sep 2013 18:20:28 +0800 Subject: net: udp: do not report ICMP redirects to user space Redirect isn't an error condition, it should leave the error handler without touching the socket. Signed-off-by: Duan Jiong Signed-off-by: David S. Miller --- net/ipv4/udp.c | 2 +- net/ipv6/udp.c | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 74d2c95db57f..0ca44df51ee9 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -658,7 +658,7 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable) break; case ICMP_REDIRECT: ipv4_sk_redirect(skb, sk); - break; + goto out; } /* diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index f4058150262b..72b7eaaf3ca0 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -525,8 +525,10 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (type == ICMPV6_PKT_TOOBIG) ip6_sk_update_pmtu(skb, sk, info); - if (type == NDISC_REDIRECT) + if (type == NDISC_REDIRECT) { ip6_sk_redirect(skb, sk); + goto out; + } np = inet6_sk(sk); -- cgit v1.2.3-58-ga151 From 8d65b1190ddc548b0411477f308d04f4595bac57 Mon Sep 17 00:00:00 2001 From: Duan Jiong Date: Fri, 20 Sep 2013 18:21:25 +0800 Subject: net: raw: do not report ICMP redirects to user space Redirect isn't an error condition, it should leave the error handler without touching the socket. Signed-off-by: Duan Jiong Signed-off-by: David S. Miller --- net/ipv4/raw.c | 4 +++- net/ipv6/raw.c | 4 +++- 2 files changed, 6 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index bfec521c717f..193db03540ad 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -218,8 +218,10 @@ static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info) if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) ipv4_sk_update_pmtu(skb, sk, info); - else if (type == ICMP_REDIRECT) + else if (type == ICMP_REDIRECT) { ipv4_sk_redirect(skb, sk); + return; + } /* Report error on raw socket, if: 1. User requested ip_recverr. diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 58916bbb1728..a4ed2416399e 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -335,8 +335,10 @@ static void rawv6_err(struct sock *sk, struct sk_buff *skb, ip6_sk_update_pmtu(skb, sk, info); harderr = (np->pmtudisc == IPV6_PMTUDISC_DO); } - if (type == NDISC_REDIRECT) + if (type == NDISC_REDIRECT) { ip6_sk_redirect(skb, sk); + return; + } if (np->recverr) { u8 *payload = skb->data; if (!inet->hdrincl) -- cgit v1.2.3-58-ga151 From 2811ebac2521ceac84f2bdae402455baa6a7fb47 Mon Sep 17 00:00:00 2001 From: Hannes Frederic Sowa Date: Sat, 21 Sep 2013 06:27:00 +0200 Subject: ipv6: udp packets following an UFO enqueued packet need also be handled by UFO In the following scenario the socket is corked: If the first UDP packet is larger then the mtu we try to append it to the write queue via ip6_ufo_append_data. A following packet, which is smaller than the mtu would be appended to the already queued up gso-skb via plain ip6_append_data. This causes random memory corruptions. In ip6_ufo_append_data we also have to be careful to not queue up the same skb multiple times. So setup the gso frame only when no first skb is available. This also fixes a shortcoming where we add the current packet's length to cork->length but return early because of a packet > mtu with dontfrag set (instead of sutracting it again). Found with trinity. Cc: YOSHIFUJI Hideaki Signed-off-by: Hannes Frederic Sowa Reported-by: Dmitry Vyukov Signed-off-by: David S. Miller --- net/ipv6/ip6_output.c | 53 +++++++++++++++++++++------------------------------ 1 file changed, 22 insertions(+), 31 deletions(-) (limited to 'net') diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 3a692d529163..a54c45ce4a48 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1015,6 +1015,8 @@ static inline int ip6_ufo_append_data(struct sock *sk, * udp datagram */ if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) { + struct frag_hdr fhdr; + skb = sock_alloc_send_skb(sk, hh_len + fragheaderlen + transhdrlen + 20, (flags & MSG_DONTWAIT), &err); @@ -1036,12 +1038,6 @@ static inline int ip6_ufo_append_data(struct sock *sk, skb->protocol = htons(ETH_P_IPV6); skb->ip_summed = CHECKSUM_PARTIAL; skb->csum = 0; - } - - err = skb_append_datato_frags(sk,skb, getfrag, from, - (length - transhdrlen)); - if (!err) { - struct frag_hdr fhdr; /* Specify the length of each IPv6 datagram fragment. * It has to be a multiple of 8. @@ -1052,15 +1048,10 @@ static inline int ip6_ufo_append_data(struct sock *sk, ipv6_select_ident(&fhdr, rt); skb_shinfo(skb)->ip6_frag_id = fhdr.identification; __skb_queue_tail(&sk->sk_write_queue, skb); - - return 0; } - /* There is not enough support do UPD LSO, - * so follow normal path - */ - kfree_skb(skb); - return err; + return skb_append_datato_frags(sk, skb, getfrag, from, + (length - transhdrlen)); } static inline struct ipv6_opt_hdr *ip6_opt_dup(struct ipv6_opt_hdr *src, @@ -1227,27 +1218,27 @@ int ip6_append_data(struct sock *sk, int getfrag(void *from, char *to, * --yoshfuji */ - cork->length += length; - if (length > mtu) { - int proto = sk->sk_protocol; - if (dontfrag && (proto == IPPROTO_UDP || proto == IPPROTO_RAW)){ - ipv6_local_rxpmtu(sk, fl6, mtu-exthdrlen); - return -EMSGSIZE; - } - - if (proto == IPPROTO_UDP && - (rt->dst.dev->features & NETIF_F_UFO)) { + if ((length > mtu) && dontfrag && (sk->sk_protocol == IPPROTO_UDP || + sk->sk_protocol == IPPROTO_RAW)) { + ipv6_local_rxpmtu(sk, fl6, mtu-exthdrlen); + return -EMSGSIZE; + } - err = ip6_ufo_append_data(sk, getfrag, from, length, - hh_len, fragheaderlen, - transhdrlen, mtu, flags, rt); - if (err) - goto error; - return 0; - } + skb = skb_peek_tail(&sk->sk_write_queue); + cork->length += length; + if (((length > mtu) || + (skb && skb_is_gso(skb))) && + (sk->sk_protocol == IPPROTO_UDP) && + (rt->dst.dev->features & NETIF_F_UFO)) { + err = ip6_ufo_append_data(sk, getfrag, from, length, + hh_len, fragheaderlen, + transhdrlen, mtu, flags, rt); + if (err) + goto error; + return 0; } - if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) + if (!skb) goto alloc_new_skb; while (length > 0) { -- cgit v1.2.3-58-ga151 From 7df37ff33dc122f7bd0614d707939fe84322d264 Mon Sep 17 00:00:00 2001 From: "Catalin\\(ux\\) M. BOIE" Date: Mon, 23 Sep 2013 23:04:19 +0300 Subject: IPv6 NAT: Do not drop DNATed 6to4/6rd packets When a router is doing DNAT for 6to4/6rd packets the latest anti-spoofing commit 218774dc ("ipv6: add anti-spoofing checks for 6to4 and 6rd") will drop them because the IPv6 address embedded does not match the IPv4 destination. This patch will allow them to pass by testing if we have an address that matches on 6to4/6rd interface. I have been hit by this problem using Fedora and IPV6TO4_IPV4ADDR. Also, log the dropped packets (with rate limit). Signed-off-by: Catalin(ux) M. BOIE Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller --- include/net/addrconf.h | 4 +++ net/ipv6/addrconf.c | 27 ++++++++++++++++ net/ipv6/sit.c | 84 +++++++++++++++++++++++++++++++++++++++++--------- 3 files changed, 100 insertions(+), 15 deletions(-) (limited to 'net') diff --git a/include/net/addrconf.h b/include/net/addrconf.h index fb314de2b61b..86505bfa5d2c 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -67,6 +67,10 @@ int ipv6_chk_addr(struct net *net, const struct in6_addr *addr, int ipv6_chk_home_addr(struct net *net, const struct in6_addr *addr); #endif +bool ipv6_chk_custom_prefix(const struct in6_addr *addr, + const unsigned int prefix_len, + struct net_device *dev); + int ipv6_chk_prefix(const struct in6_addr *addr, struct net_device *dev); struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net, diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index d6ff12617f36..a0c3abe72461 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1499,6 +1499,33 @@ static bool ipv6_chk_same_addr(struct net *net, const struct in6_addr *addr, return false; } +/* Compares an address/prefix_len with addresses on device @dev. + * If one is found it returns true. + */ +bool ipv6_chk_custom_prefix(const struct in6_addr *addr, + const unsigned int prefix_len, struct net_device *dev) +{ + struct inet6_dev *idev; + struct inet6_ifaddr *ifa; + bool ret = false; + + rcu_read_lock(); + idev = __in6_dev_get(dev); + if (idev) { + read_lock_bh(&idev->lock); + list_for_each_entry(ifa, &idev->addr_list, if_list) { + ret = ipv6_prefix_equal(addr, &ifa->addr, prefix_len); + if (ret) + break; + } + read_unlock_bh(&idev->lock); + } + rcu_read_unlock(); + + return ret; +} +EXPORT_SYMBOL(ipv6_chk_custom_prefix); + int ipv6_chk_prefix(const struct in6_addr *addr, struct net_device *dev) { struct inet6_dev *idev; diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index 7ee5cb96db34..afd5605aea7c 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -566,6 +566,70 @@ static inline bool is_spoofed_6rd(struct ip_tunnel *tunnel, const __be32 v4addr, return false; } +/* Checks if an address matches an address on the tunnel interface. + * Used to detect the NAT of proto 41 packets and let them pass spoofing test. + * Long story: + * This function is called after we considered the packet as spoofed + * in is_spoofed_6rd. + * We may have a router that is doing NAT for proto 41 packets + * for an internal station. Destination a.a.a.a/PREFIX:bbbb:bbbb + * will be translated to n.n.n.n/PREFIX:bbbb:bbbb. And is_spoofed_6rd + * function will return true, dropping the packet. + * But, we can still check if is spoofed against the IP + * addresses associated with the interface. + */ +static bool only_dnatted(const struct ip_tunnel *tunnel, + const struct in6_addr *v6dst) +{ + int prefix_len; + +#ifdef CONFIG_IPV6_SIT_6RD + prefix_len = tunnel->ip6rd.prefixlen + 32 + - tunnel->ip6rd.relay_prefixlen; +#else + prefix_len = 48; +#endif + return ipv6_chk_custom_prefix(v6dst, prefix_len, tunnel->dev); +} + +/* Returns true if a packet is spoofed */ +static bool packet_is_spoofed(struct sk_buff *skb, + const struct iphdr *iph, + struct ip_tunnel *tunnel) +{ + const struct ipv6hdr *ipv6h; + + if (tunnel->dev->priv_flags & IFF_ISATAP) { + if (!isatap_chksrc(skb, iph, tunnel)) + return true; + + return false; + } + + if (tunnel->dev->flags & IFF_POINTOPOINT) + return false; + + ipv6h = ipv6_hdr(skb); + + if (unlikely(is_spoofed_6rd(tunnel, iph->saddr, &ipv6h->saddr))) { + net_warn_ratelimited("Src spoofed %pI4/%pI6c -> %pI4/%pI6c\n", + &iph->saddr, &ipv6h->saddr, + &iph->daddr, &ipv6h->daddr); + return true; + } + + if (likely(!is_spoofed_6rd(tunnel, iph->daddr, &ipv6h->daddr))) + return false; + + if (only_dnatted(tunnel, &ipv6h->daddr)) + return false; + + net_warn_ratelimited("Dst spoofed %pI4/%pI6c -> %pI4/%pI6c\n", + &iph->saddr, &ipv6h->saddr, + &iph->daddr, &ipv6h->daddr); + return true; +} + static int ipip6_rcv(struct sk_buff *skb) { const struct iphdr *iph = ip_hdr(skb); @@ -586,19 +650,9 @@ static int ipip6_rcv(struct sk_buff *skb) IPCB(skb)->flags = 0; skb->protocol = htons(ETH_P_IPV6); - if (tunnel->dev->priv_flags & IFF_ISATAP) { - if (!isatap_chksrc(skb, iph, tunnel)) { - tunnel->dev->stats.rx_errors++; - goto out; - } - } else if (!(tunnel->dev->flags&IFF_POINTOPOINT)) { - if (is_spoofed_6rd(tunnel, iph->saddr, - &ipv6_hdr(skb)->saddr) || - is_spoofed_6rd(tunnel, iph->daddr, - &ipv6_hdr(skb)->daddr)) { - tunnel->dev->stats.rx_errors++; - goto out; - } + if (packet_is_spoofed(skb, iph, tunnel)) { + tunnel->dev->stats.rx_errors++; + goto out; } __skb_tunnel_rx(skb, tunnel->dev, tunnel->net); @@ -748,7 +802,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, neigh = dst_neigh_lookup(skb_dst(skb), &iph6->daddr); if (neigh == NULL) { - net_dbg_ratelimited("sit: nexthop == NULL\n"); + net_dbg_ratelimited("nexthop == NULL\n"); goto tx_error; } @@ -777,7 +831,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, neigh = dst_neigh_lookup(skb_dst(skb), &iph6->daddr); if (neigh == NULL) { - net_dbg_ratelimited("sit: nexthop == NULL\n"); + net_dbg_ratelimited("nexthop == NULL\n"); goto tx_error; } -- cgit v1.2.3-58-ga151 From 50624c934db18ab90aaea4908f60dd39aab4e6e5 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 23 Sep 2013 21:19:49 -0700 Subject: net: Delay default_device_exit_batch until no devices are unregistering v2 There is currently serialization network namespaces exiting and network devices exiting as the final part of netdev_run_todo does not happen under the rtnl_lock. This is compounded by the fact that the only list of devices unregistering in netdev_run_todo is local to the netdev_run_todo. This lack of serialization in extreme cases results in network devices unregistering in netdev_run_todo after the loopback device of their network namespace has been freed (making dst_ifdown unsafe), and after the their network namespace has exited (making the NETDEV_UNREGISTER, and NETDEV_UNREGISTER_FINAL callbacks unsafe). Add the missing serialization by a per network namespace count of how many network devices are unregistering and having a wait queue that is woken up whenever the count is decreased. The count and wait queue allow default_device_exit_batch to wait until all of the unregistration activity for a network namespace has finished before proceeding to unregister the loopback device and then allowing the network namespace to exit. Only a single global wait queue is used because there is a single global lock, and there is a single waiter, per network namespace wait queues would be a waste of resources. The per network namespace count of unregistering devices gives a progress guarantee because the number of network devices unregistering in an exiting network namespace must ultimately drop to zero (assuming network device unregistration completes). The basic logic remains the same as in v1. This patch is now half comment and half rtnl_lock_unregistering an expanded version of wait_event performs no extra work in the common case where no network devices are unregistering when we get to default_device_exit_batch. Reported-by: Francesco Ruggeri Signed-off-by: "Eric W. Biederman" Signed-off-by: David S. Miller --- include/net/net_namespace.h | 1 + net/core/dev.c | 49 ++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 49 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 1313456a0994..9d22f08896c6 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -74,6 +74,7 @@ struct net { struct hlist_head *dev_index_head; unsigned int dev_base_seq; /* protected by rtnl_mutex */ int ifindex; + unsigned int dev_unreg_count; /* core fib_rules */ struct list_head rules_ops; diff --git a/net/core/dev.c b/net/core/dev.c index 5c713f2239cc..65f829cfd928 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5247,10 +5247,12 @@ static int dev_new_index(struct net *net) /* Delayed registration/unregisteration */ static LIST_HEAD(net_todo_list); +static DECLARE_WAIT_QUEUE_HEAD(netdev_unregistering_wq); static void net_set_todo(struct net_device *dev) { list_add_tail(&dev->todo_list, &net_todo_list); + dev_net(dev)->dev_unreg_count++; } static void rollback_registered_many(struct list_head *head) @@ -5918,6 +5920,12 @@ void netdev_run_todo(void) if (dev->destructor) dev->destructor(dev); + /* Report a network device has been unregistered */ + rtnl_lock(); + dev_net(dev)->dev_unreg_count--; + __rtnl_unlock(); + wake_up(&netdev_unregistering_wq); + /* Free network device */ kobject_put(&dev->dev.kobj); } @@ -6603,6 +6611,34 @@ static void __net_exit default_device_exit(struct net *net) rtnl_unlock(); } +static void __net_exit rtnl_lock_unregistering(struct list_head *net_list) +{ + /* Return with the rtnl_lock held when there are no network + * devices unregistering in any network namespace in net_list. + */ + struct net *net; + bool unregistering; + DEFINE_WAIT(wait); + + for (;;) { + prepare_to_wait(&netdev_unregistering_wq, &wait, + TASK_UNINTERRUPTIBLE); + unregistering = false; + rtnl_lock(); + list_for_each_entry(net, net_list, exit_list) { + if (net->dev_unreg_count > 0) { + unregistering = true; + break; + } + } + if (!unregistering) + break; + __rtnl_unlock(); + schedule(); + } + finish_wait(&netdev_unregistering_wq, &wait); +} + static void __net_exit default_device_exit_batch(struct list_head *net_list) { /* At exit all network devices most be removed from a network @@ -6614,7 +6650,18 @@ static void __net_exit default_device_exit_batch(struct list_head *net_list) struct net *net; LIST_HEAD(dev_kill_list); - rtnl_lock(); + /* To prevent network device cleanup code from dereferencing + * loopback devices or network devices that have been freed + * wait here for all pending unregistrations to complete, + * before unregistring the loopback device and allowing the + * network namespace be freed. + * + * The netdev todo list containing all network devices + * unregistrations that happen in default_device_exit_batch + * will run in the rtnl_unlock() at the end of + * default_device_exit_batch. + */ + rtnl_lock_unregistering(net_list); list_for_each_entry(net, net_list, exit_list) { for_each_netdev_reverse(net, dev) { if (dev->rtnl_link_ops) -- cgit v1.2.3-58-ga151 From 9a3bab6b05383f1e4c3716b3615500c51285959e Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 24 Sep 2013 06:19:57 -0700 Subject: net: net_secret should not depend on TCP A host might need net_secret[] and never open a single socket. Problem added in commit aebda156a570782 ("net: defer net_secret[] initialization") Based on prior patch from Hannes Frederic Sowa. Reported-by: Hannes Frederic Sowa Signed-off-by: Eric Dumazet Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller --- include/net/secure_seq.h | 1 - net/core/secure_seq.c | 27 ++++++++++++++++++++++++--- net/ipv4/af_inet.c | 4 +--- 3 files changed, 25 insertions(+), 7 deletions(-) (limited to 'net') diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h index 6ca975bebd37..c2e542b27a5a 100644 --- a/include/net/secure_seq.h +++ b/include/net/secure_seq.h @@ -3,7 +3,6 @@ #include -extern void net_secret_init(void); extern __u32 secure_ip_id(__be32 daddr); extern __u32 secure_ipv6_id(const __be32 daddr[4]); extern u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport); diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c index 6a2f13cee86a..3f1ec1586ae1 100644 --- a/net/core/secure_seq.c +++ b/net/core/secure_seq.c @@ -10,11 +10,24 @@ #include -static u32 net_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned; +#define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4) -void net_secret_init(void) +static u32 net_secret[NET_SECRET_SIZE] ____cacheline_aligned; + +static void net_secret_init(void) { - get_random_bytes(net_secret, sizeof(net_secret)); + u32 tmp; + int i; + + if (likely(net_secret[0])) + return; + + for (i = NET_SECRET_SIZE; i > 0;) { + do { + get_random_bytes(&tmp, sizeof(tmp)); + } while (!tmp); + cmpxchg(&net_secret[--i], 0, tmp); + } } #ifdef CONFIG_INET @@ -42,6 +55,7 @@ __u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr, u32 hash[MD5_DIGEST_WORDS]; u32 i; + net_secret_init(); memcpy(hash, saddr, 16); for (i = 0; i < 4; i++) secret[i] = net_secret[i] + (__force u32)daddr[i]; @@ -63,6 +77,7 @@ u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, u32 hash[MD5_DIGEST_WORDS]; u32 i; + net_secret_init(); memcpy(hash, saddr, 16); for (i = 0; i < 4; i++) secret[i] = net_secret[i] + (__force u32) daddr[i]; @@ -82,6 +97,7 @@ __u32 secure_ip_id(__be32 daddr) { u32 hash[MD5_DIGEST_WORDS]; + net_secret_init(); hash[0] = (__force __u32) daddr; hash[1] = net_secret[13]; hash[2] = net_secret[14]; @@ -96,6 +112,7 @@ __u32 secure_ipv6_id(const __be32 daddr[4]) { __u32 hash[4]; + net_secret_init(); memcpy(hash, daddr, 16); md5_transform(hash, net_secret); @@ -107,6 +124,7 @@ __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr, { u32 hash[MD5_DIGEST_WORDS]; + net_secret_init(); hash[0] = (__force u32)saddr; hash[1] = (__force u32)daddr; hash[2] = ((__force u16)sport << 16) + (__force u16)dport; @@ -121,6 +139,7 @@ u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport) { u32 hash[MD5_DIGEST_WORDS]; + net_secret_init(); hash[0] = (__force u32)saddr; hash[1] = (__force u32)daddr; hash[2] = (__force u32)dport ^ net_secret[14]; @@ -140,6 +159,7 @@ u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr, u32 hash[MD5_DIGEST_WORDS]; u64 seq; + net_secret_init(); hash[0] = (__force u32)saddr; hash[1] = (__force u32)daddr; hash[2] = ((__force u16)sport << 16) + (__force u16)dport; @@ -164,6 +184,7 @@ u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr, u64 seq; u32 i; + net_secret_init(); memcpy(hash, saddr, 16); for (i = 0; i < 4; i++) secret[i] = net_secret[i] + daddr[i]; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 7a1874b7b8fd..cfeb85cff4f0 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -263,10 +263,8 @@ void build_ehash_secret(void) get_random_bytes(&rnd, sizeof(rnd)); } while (rnd == 0); - if (cmpxchg(&inet_ehash_secret, 0, rnd) == 0) { + if (cmpxchg(&inet_ehash_secret, 0, rnd) == 0) get_random_bytes(&ipv6_hash_secret, sizeof(ipv6_hash_secret)); - net_secret_init(); - } } EXPORT_SYMBOL(build_ehash_secret); -- cgit v1.2.3-58-ga151 From f4a87e7bd2eaef26a3ca25437ce8b807de2966ad Mon Sep 17 00:00:00 2001 From: Patrick McHardy Date: Mon, 30 Sep 2013 08:51:46 +0100 Subject: netfilter: synproxy: fix BUG_ON triggered by corrupt TCP packets TCP packets hitting the SYN proxy through the SYNPROXY target are not validated by TCP conntrack. When th->doff is below 5, an underflow happens when calculating the options length, causing skb_header_pointer() to return NULL and triggering the BUG_ON(). Handle this case gracefully by checking for NULL instead of using BUG_ON(). Reported-by: Martin Topholm Tested-by: Martin Topholm Signed-off-by: Patrick McHardy Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack_synproxy.h | 2 +- net/ipv4/netfilter/ipt_SYNPROXY.c | 10 +++++++--- net/ipv6/netfilter/ip6t_SYNPROXY.c | 10 +++++++--- net/netfilter/nf_synproxy_core.c | 12 +++++++----- 4 files changed, 22 insertions(+), 12 deletions(-) (limited to 'net') diff --git a/include/net/netfilter/nf_conntrack_synproxy.h b/include/net/netfilter/nf_conntrack_synproxy.h index 806f54a290d6..f572f313d6f1 100644 --- a/include/net/netfilter/nf_conntrack_synproxy.h +++ b/include/net/netfilter/nf_conntrack_synproxy.h @@ -56,7 +56,7 @@ struct synproxy_options { struct tcphdr; struct xt_synproxy_info; -extern void synproxy_parse_options(const struct sk_buff *skb, unsigned int doff, +extern bool synproxy_parse_options(const struct sk_buff *skb, unsigned int doff, const struct tcphdr *th, struct synproxy_options *opts); extern unsigned int synproxy_options_size(const struct synproxy_options *opts); diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c index 67e17dcda65e..b6346bf2fde3 100644 --- a/net/ipv4/netfilter/ipt_SYNPROXY.c +++ b/net/ipv4/netfilter/ipt_SYNPROXY.c @@ -267,7 +267,8 @@ synproxy_tg4(struct sk_buff *skb, const struct xt_action_param *par) if (th == NULL) return NF_DROP; - synproxy_parse_options(skb, par->thoff, th, &opts); + if (!synproxy_parse_options(skb, par->thoff, th, &opts)) + return NF_DROP; if (th->syn && !(th->ack || th->fin || th->rst)) { /* Initial SYN from client */ @@ -350,7 +351,8 @@ static unsigned int ipv4_synproxy_hook(unsigned int hooknum, /* fall through */ case TCP_CONNTRACK_SYN_SENT: - synproxy_parse_options(skb, thoff, th, &opts); + if (!synproxy_parse_options(skb, thoff, th, &opts)) + return NF_DROP; if (!th->syn && th->ack && CTINFO2DIR(ctinfo) == IP_CT_DIR_ORIGINAL) { @@ -373,7 +375,9 @@ static unsigned int ipv4_synproxy_hook(unsigned int hooknum, if (!th->syn || !th->ack) break; - synproxy_parse_options(skb, thoff, th, &opts); + if (!synproxy_parse_options(skb, thoff, th, &opts)) + return NF_DROP; + if (opts.options & XT_SYNPROXY_OPT_TIMESTAMP) synproxy->tsoff = opts.tsval - synproxy->its; diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c b/net/ipv6/netfilter/ip6t_SYNPROXY.c index 19cfea8dbcaa..2748b042da72 100644 --- a/net/ipv6/netfilter/ip6t_SYNPROXY.c +++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c @@ -282,7 +282,8 @@ synproxy_tg6(struct sk_buff *skb, const struct xt_action_param *par) if (th == NULL) return NF_DROP; - synproxy_parse_options(skb, par->thoff, th, &opts); + if (!synproxy_parse_options(skb, par->thoff, th, &opts)) + return NF_DROP; if (th->syn && !(th->ack || th->fin || th->rst)) { /* Initial SYN from client */ @@ -372,7 +373,8 @@ static unsigned int ipv6_synproxy_hook(unsigned int hooknum, /* fall through */ case TCP_CONNTRACK_SYN_SENT: - synproxy_parse_options(skb, thoff, th, &opts); + if (!synproxy_parse_options(skb, thoff, th, &opts)) + return NF_DROP; if (!th->syn && th->ack && CTINFO2DIR(ctinfo) == IP_CT_DIR_ORIGINAL) { @@ -395,7 +397,9 @@ static unsigned int ipv6_synproxy_hook(unsigned int hooknum, if (!th->syn || !th->ack) break; - synproxy_parse_options(skb, thoff, th, &opts); + if (!synproxy_parse_options(skb, thoff, th, &opts)) + return NF_DROP; + if (opts.options & XT_SYNPROXY_OPT_TIMESTAMP) synproxy->tsoff = opts.tsval - synproxy->its; diff --git a/net/netfilter/nf_synproxy_core.c b/net/netfilter/nf_synproxy_core.c index 6fd967c6278c..cdf4567ba9b3 100644 --- a/net/netfilter/nf_synproxy_core.c +++ b/net/netfilter/nf_synproxy_core.c @@ -24,7 +24,7 @@ int synproxy_net_id; EXPORT_SYMBOL_GPL(synproxy_net_id); -void +bool synproxy_parse_options(const struct sk_buff *skb, unsigned int doff, const struct tcphdr *th, struct synproxy_options *opts) { @@ -32,7 +32,8 @@ synproxy_parse_options(const struct sk_buff *skb, unsigned int doff, u8 buf[40], *ptr; ptr = skb_header_pointer(skb, doff + sizeof(*th), length, buf); - BUG_ON(ptr == NULL); + if (ptr == NULL) + return false; opts->options = 0; while (length > 0) { @@ -41,16 +42,16 @@ synproxy_parse_options(const struct sk_buff *skb, unsigned int doff, switch (opcode) { case TCPOPT_EOL: - return; + return true; case TCPOPT_NOP: length--; continue; default: opsize = *ptr++; if (opsize < 2) - return; + return true; if (opsize > length) - return; + return true; switch (opcode) { case TCPOPT_MSS: @@ -84,6 +85,7 @@ synproxy_parse_options(const struct sk_buff *skb, unsigned int doff, length -= opsize; } } + return true; } EXPORT_SYMBOL_GPL(synproxy_parse_options); -- cgit v1.2.3-58-ga151 From d4a71b155c12d0d429c6b69d94076d6d57e2a7a7 Mon Sep 17 00:00:00 2001 From: Pravin B Shelar Date: Wed, 25 Sep 2013 09:57:47 -0700 Subject: ip_tunnel: Do not use stale inner_iph pointer. While sending packet skb_cow_head() can change skb header which invalidates inner_iph pointer to skb header. Following patch avoid using it. Found by code inspection. This bug was introduced by commit 0e6fbc5b6c6218 (ip_tunnels: extend iptunnel_xmit()). Signed-off-by: Pravin B Shelar Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- net/ipv4/ip_tunnel.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index ac9fabe0300f..d3fbad422e0e 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -623,6 +623,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, tunnel->err_count = 0; } + tos = ip_tunnel_ecn_encap(tos, inner_iph, skb); ttl = tnl_params->ttl; if (ttl == 0) { if (skb->protocol == htons(ETH_P_IP)) @@ -651,8 +652,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, } err = iptunnel_xmit(rt, skb, fl4.saddr, fl4.daddr, protocol, - ip_tunnel_ecn_encap(tos, inner_iph, skb), ttl, df, - !net_eq(tunnel->net, dev_net(dev))); + tos, ttl, df, !net_eq(tunnel->net, dev_net(dev))); iptunnel_xmit_stats(err, &dev->stats, dev->tstats); return; -- cgit v1.2.3-58-ga151 From c9d55d5bff05084b5829f751aebd03d0c8f632f5 Mon Sep 17 00:00:00 2001 From: Paul Marks Date: Wed, 25 Sep 2013 15:12:55 -0700 Subject: ipv6: Fix preferred_lft not updating in some cases Consider the scenario where an IPv6 router is advertising a fixed preferred_lft of 1800 seconds, while the valid_lft begins at 3600 seconds and counts down in realtime. A client should reset its preferred_lft to 1800 every time the RA is received, but a bug is causing Linux to ignore the update. The core problem is here: if (prefered_lft != ifp->prefered_lft) { Note that ifp->prefered_lft is an offset, so it doesn't decrease over time. Thus, the comparison is always (1800 != 1800), which fails to trigger an update. The most direct solution would be to compute a "stored_prefered_lft", and use that value in the comparison. But I think that trying to filter out unnecessary updates here is a premature optimization. In order for the filter to apply, both of these would need to hold: - The advertised valid_lft and preferred_lft are both declining in real time. - No clock skew exists between the router & client. So in this patch, I've set "update_lft = 1" unconditionally, which allows the surrounding code to be greatly simplified. Signed-off-by: Paul Marks Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller --- net/ipv6/addrconf.c | 52 +++++++++++++++------------------------------------- 1 file changed, 15 insertions(+), 37 deletions(-) (limited to 'net') diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index a0c3abe72461..cd3fb301da38 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2220,43 +2220,21 @@ ok: else stored_lft = 0; if (!update_lft && !create && stored_lft) { - if (valid_lft > MIN_VALID_LIFETIME || - valid_lft > stored_lft) - update_lft = 1; - else if (stored_lft <= MIN_VALID_LIFETIME) { - /* valid_lft <= stored_lft is always true */ - /* - * RFC 4862 Section 5.5.3e: - * "Note that the preferred lifetime of - * the corresponding address is always - * reset to the Preferred Lifetime in - * the received Prefix Information - * option, regardless of whether the - * valid lifetime is also reset or - * ignored." - * - * So if the preferred lifetime in - * this advertisement is different - * than what we have stored, but the - * valid lifetime is invalid, just - * reset prefered_lft. - * - * We must set the valid lifetime - * to the stored lifetime since we'll - * be updating the timestamp below, - * else we'll set it back to the - * minimum. - */ - if (prefered_lft != ifp->prefered_lft) { - valid_lft = stored_lft; - update_lft = 1; - } - } else { - valid_lft = MIN_VALID_LIFETIME; - if (valid_lft < prefered_lft) - prefered_lft = valid_lft; - update_lft = 1; - } + const u32 minimum_lft = min( + stored_lft, (u32)MIN_VALID_LIFETIME); + valid_lft = max(valid_lft, minimum_lft); + + /* RFC4862 Section 5.5.3e: + * "Note that the preferred lifetime of the + * corresponding address is always reset to + * the Preferred Lifetime in the received + * Prefix Information option, regardless of + * whether the valid lifetime is also reset or + * ignored." + * + * So we should always update prefered_lft here. + */ + update_lft = 1; } if (update_lft) { -- cgit v1.2.3-58-ga151 From b86783587b3d1d552326d955acee37eac48800f1 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 26 Sep 2013 08:44:06 -0700 Subject: net: flow_dissector: fix thoff for IPPROTO_AH In commit 8ed781668dd49 ("flow_keys: include thoff into flow_keys for later usage"), we missed that existing code was using nhoff as a temporary variable that could not always contain transport header offset. This is not a problem for TCP/UDP because port offset (@poff) is 0 for these protocols. Signed-off-by: Eric Dumazet Cc: Daniel Borkmann Cc: Nikolay Aleksandrov Acked-by: Nikolay Aleksandrov Acked-by: Daniel Borkmann Signed-off-by: David S. Miller --- net/core/flow_dissector.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 1929af87b260..8d7d0dd72db2 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -154,8 +154,8 @@ ipv6: if (poff >= 0) { __be32 *ports, _ports; - nhoff += poff; - ports = skb_header_pointer(skb, nhoff, sizeof(_ports), &_ports); + ports = skb_header_pointer(skb, nhoff + poff, + sizeof(_ports), &_ports); if (ports) flow->ports = *ports; } -- cgit v1.2.3-58-ga151 From 8d34ce10c59b40e0ce2685341c4e93416f505e45 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 27 Sep 2013 14:20:01 -0700 Subject: pkt_sched: fq: qdisc dismantle fixes fq_reset() should drops all packets in queue, including throttled flows. This patch moves code from fq_destroy() to fq_reset() to do the cleaning. fq_change() must stop calling fq_dequeue() if all remaining packets are from throttled flows. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- net/sched/sch_fq.c | 57 +++++++++++++++++++++++++++++++++++------------------- 1 file changed, 37 insertions(+), 20 deletions(-) (limited to 'net') diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index 32ad015ee8ce..fc6de56a331e 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -285,7 +285,7 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q) /* remove one skb from head of flow queue */ -static struct sk_buff *fq_dequeue_head(struct fq_flow *flow) +static struct sk_buff *fq_dequeue_head(struct Qdisc *sch, struct fq_flow *flow) { struct sk_buff *skb = flow->head; @@ -293,6 +293,8 @@ static struct sk_buff *fq_dequeue_head(struct fq_flow *flow) flow->head = skb->next; skb->next = NULL; flow->qlen--; + sch->qstats.backlog -= qdisc_pkt_len(skb); + sch->q.qlen--; } return skb; } @@ -419,7 +421,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch) struct sk_buff *skb; struct fq_flow *f; - skb = fq_dequeue_head(&q->internal); + skb = fq_dequeue_head(sch, &q->internal); if (skb) goto out; fq_check_throttled(q, now); @@ -449,7 +451,7 @@ begin: goto begin; } - skb = fq_dequeue_head(f); + skb = fq_dequeue_head(sch, f); if (!skb) { head->first = f->next; /* force a pass through old_flows to prevent starvation */ @@ -490,19 +492,44 @@ begin: } } out: - sch->qstats.backlog -= qdisc_pkt_len(skb); qdisc_bstats_update(sch, skb); - sch->q.qlen--; qdisc_unthrottled(sch); return skb; } static void fq_reset(struct Qdisc *sch) { + struct fq_sched_data *q = qdisc_priv(sch); + struct rb_root *root; struct sk_buff *skb; + struct rb_node *p; + struct fq_flow *f; + unsigned int idx; - while ((skb = fq_dequeue(sch)) != NULL) + while ((skb = fq_dequeue_head(sch, &q->internal)) != NULL) kfree_skb(skb); + + if (!q->fq_root) + return; + + for (idx = 0; idx < (1U << q->fq_trees_log); idx++) { + root = &q->fq_root[idx]; + while ((p = rb_first(root)) != NULL) { + f = container_of(p, struct fq_flow, fq_node); + rb_erase(p, root); + + while ((skb = fq_dequeue_head(sch, f)) != NULL) + kfree_skb(skb); + + kmem_cache_free(fq_flow_cachep, f); + } + } + q->new_flows.first = NULL; + q->old_flows.first = NULL; + q->delayed = RB_ROOT; + q->flows = 0; + q->inactive_flows = 0; + q->throttled_flows = 0; } static void fq_rehash(struct fq_sched_data *q, @@ -645,6 +672,8 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt) while (sch->q.qlen > sch->limit) { struct sk_buff *skb = fq_dequeue(sch); + if (!skb) + break; kfree_skb(skb); drop_count++; } @@ -657,21 +686,9 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt) static void fq_destroy(struct Qdisc *sch) { struct fq_sched_data *q = qdisc_priv(sch); - struct rb_root *root; - struct rb_node *p; - unsigned int idx; - if (q->fq_root) { - for (idx = 0; idx < (1U << q->fq_trees_log); idx++) { - root = &q->fq_root[idx]; - while ((p = rb_first(root)) != NULL) { - rb_erase(p, root); - kmem_cache_free(fq_flow_cachep, - container_of(p, struct fq_flow, fq_node)); - } - } - kfree(q->fq_root); - } + fq_reset(sch); + kfree(q->fq_root); qdisc_watchdog_cancel(&q->watchdog); } -- cgit v1.2.3-58-ga151 From c9eeec26e32e087359160406f96e0949b3cc6f10 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 27 Sep 2013 03:28:54 -0700 Subject: tcp: TSQ can use a dynamic limit When TCP Small Queues was added, we used a sysctl to limit amount of packets queues on Qdisc/device queues for a given TCP flow. Problem is this limit is either too big for low rates, or too small for high rates. Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO auto sizing, it can better control number of packets in Qdisc/device queues. New limit is two packets or at least 1 to 2 ms worth of packets. Low rates flows benefit from this patch by having even smaller number of packets in queues, allowing for faster recovery, better RTT estimations. High rates flows benefit from this patch by allowing more than 2 packets in flight as we had reports this was a limiting factor to reach line rate. [ In particular if TX completion is delayed because of coalescing parameters ] Example for a single flow on 10Gbp link controlled by FQ/pacing 14 packets in flight instead of 2 $ tc -s -d qd qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0 requeues 6822476) rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476 2047 flow, 2046 inactive, 1 throttled, delay 15673 ns 2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit Note that sk_pacing_rate is currently set to twice the actual rate, but this might be refined in the future when a flow is in congestion avoidance. Additional change : skb->destructor should be set to tcp_wfree(). A future patch (for linux 3.13+) might remove tcp_limit_output_bytes Signed-off-by: Eric Dumazet Cc: Wei Liu Cc: Cong Wang Cc: Yuchung Cheng Cc: Neal Cardwell Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- net/ipv4/tcp_output.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) (limited to 'net') diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 7c83cb8bf137..e6bb8256e59f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -895,8 +895,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, skb_orphan(skb); skb->sk = sk; - skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ? - tcp_wfree : sock_wfree; + skb->destructor = tcp_wfree; atomic_add(skb->truesize, &sk->sk_wmem_alloc); /* Build TCP header and checksum it. */ @@ -1840,7 +1839,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, while ((skb = tcp_send_head(sk))) { unsigned int limit; - tso_segs = tcp_init_tso_segs(sk, skb, mss_now); BUG_ON(!tso_segs); @@ -1869,13 +1867,20 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } - /* TSQ : sk_wmem_alloc accounts skb truesize, - * including skb overhead. But thats OK. + /* TCP Small Queues : + * Control number of packets in qdisc/devices to two packets / or ~1 ms. + * This allows for : + * - better RTT estimation and ACK scheduling + * - faster recovery + * - high rates */ - if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { + limit = max(skb->truesize, sk->sk_pacing_rate >> 10); + + if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); break; } + limit = mss_now; if (tso_segs > 1 && !tcp_urg_mode(tp)) limit = tcp_mss_split_point(sk, skb, mss_now, -- cgit v1.2.3-58-ga151 From 3da812d860755925da890e8c713f2d2e2d7b1bae Mon Sep 17 00:00:00 2001 From: Hannes Frederic Sowa Date: Sun, 29 Sep 2013 05:40:50 +0200 Subject: ipv6: gre: correct calculation of max_headroom gre_hlen already accounts for sizeof(struct ipv6_hdr) + gre header, so initialize max_headroom to zero. Otherwise the if (encap_limit >= 0) { max_headroom += 8; mtu -= 8; } increments an uninitialized variable before max_headroom was reset. Found with coverity: 728539 Cc: Dmitry Kozlov Signed-off-by: Hannes Frederic Sowa Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- net/ipv6/ip6_gre.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 6b26e9feafb9..7bb5446b9d73 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -618,7 +618,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb, struct ip6_tnl *tunnel = netdev_priv(dev); struct net_device *tdev; /* Device to other host */ struct ipv6hdr *ipv6h; /* Our new IP header */ - unsigned int max_headroom; /* The extra header space needed */ + unsigned int max_headroom = 0; /* The extra header space needed */ int gre_hlen; struct ipv6_tel_txoption opt; int mtu; @@ -693,7 +693,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb, skb_scrub_packet(skb, !net_eq(tunnel->net, dev_net(dev))); - max_headroom = LL_RESERVED_SPACE(tdev) + gre_hlen + dst->header_len; + max_headroom += LL_RESERVED_SPACE(tdev) + gre_hlen + dst->header_len; if (skb_headroom(skb) < max_headroom || skb_shared(skb) || (skb_cloned(skb) && !skb_clone_writable(skb, 0))) { -- cgit v1.2.3-58-ga151 From e2401654dd0f5f3fb7a8d80dad9554d73d7ca394 Mon Sep 17 00:00:00 2001 From: Salam Noureddine Date: Sun, 29 Sep 2013 13:39:42 -0700 Subject: ipv4 igmp: use in_dev_put in timer handlers instead of __in_dev_put It is possible for the timer handlers to run after the call to ip_mc_down so use in_dev_put instead of __in_dev_put in the handler function in order to do proper cleanup when the refcnt reaches 0. Otherwise, the refcnt can reach zero without the in_device being destroyed and we end up leaking a reference to the net_device and see messages like the following, unregister_netdevice: waiting for eth0 to become free. Usage count = 1 Tested on linux-3.4.43. Signed-off-by: Salam Noureddine Signed-off-by: David S. Miller --- net/ipv4/igmp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index dace87f06e5f..7defdc9ba167 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -736,7 +736,7 @@ static void igmp_gq_timer_expire(unsigned long data) in_dev->mr_gq_running = 0; igmpv3_send_report(in_dev, NULL); - __in_dev_put(in_dev); + in_dev_put(in_dev); } static void igmp_ifc_timer_expire(unsigned long data) @@ -749,7 +749,7 @@ static void igmp_ifc_timer_expire(unsigned long data) igmp_ifc_start_timer(in_dev, unsolicited_report_interval(in_dev)); } - __in_dev_put(in_dev); + in_dev_put(in_dev); } static void igmp_ifc_event(struct in_device *in_dev) -- cgit v1.2.3-58-ga151 From 9260d3e1013701aa814d10c8fc6a9f92bd17d643 Mon Sep 17 00:00:00 2001 From: Salam Noureddine Date: Sun, 29 Sep 2013 13:41:34 -0700 Subject: ipv6 mcast: use in6_dev_put in timer handlers instead of __in6_dev_put It is possible for the timer handlers to run after the call to ipv6_mc_down so use in6_dev_put instead of __in6_dev_put in the handler function in order to do proper cleanup when the refcnt reaches 0. Otherwise, the refcnt can reach zero without the inet6_dev being destroyed and we end up leaking a reference to the net_device and see messages like the following, unregister_netdevice: waiting for eth0 to become free. Usage count = 1 Tested on linux-3.4.43. Signed-off-by: Salam Noureddine Signed-off-by: David S. Miller --- net/ipv6/mcast.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'net') diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 096cd67b737c..d18f9f903db6 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -2034,7 +2034,7 @@ static void mld_dad_timer_expire(unsigned long data) if (idev->mc_dad_count) mld_dad_start_timer(idev, idev->mc_maxdelay); } - __in6_dev_put(idev); + in6_dev_put(idev); } static int ip6_mc_del1_src(struct ifmcaddr6 *pmc, int sfmode, @@ -2379,7 +2379,7 @@ static void mld_gq_timer_expire(unsigned long data) idev->mc_gq_running = 0; mld_send_report(idev, NULL); - __in6_dev_put(idev); + in6_dev_put(idev); } static void mld_ifc_timer_expire(unsigned long data) @@ -2392,7 +2392,7 @@ static void mld_ifc_timer_expire(unsigned long data) if (idev->mc_ifc_count) mld_ifc_start_timer(idev, idev->mc_maxdelay); } - __in6_dev_put(idev); + in6_dev_put(idev); } static void mld_ifc_event(struct inet6_dev *idev) -- cgit v1.2.3-58-ga151 From 3e08f4a72f689c6296d336c2aab4bddd60c93ae2 Mon Sep 17 00:00:00 2001 From: Steffen Klassert Date: Tue, 1 Oct 2013 11:33:59 +0200 Subject: ip_tunnel: Fix a memory corruption in ip_tunnel_xmit We might extend the used aera of a skb beyond the total headroom when we install the ipip header. Fix this by calling skb_cow_head() unconditionally. Bug was introduced with commit c544193214 ("GRE: Refactor GRE tunneling code.") Cc: Pravin Shelar Signed-off-by: Steffen Klassert Signed-off-by: David S. Miller --- net/ipv4/ip_tunnel.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'net') diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index d3fbad422e0e..dfc6d8a3caa7 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -642,13 +642,13 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, max_headroom = LL_RESERVED_SPACE(rt->dst.dev) + sizeof(struct iphdr) + rt->dst.header_len; - if (max_headroom > dev->needed_headroom) { + if (max_headroom > dev->needed_headroom) dev->needed_headroom = max_headroom; - if (skb_cow_head(skb, dev->needed_headroom)) { - dev->stats.tx_dropped++; - dev_kfree_skb(skb); - return; - } + + if (skb_cow_head(skb, dev->needed_headroom)) { + dev->stats.tx_dropped++; + dev_kfree_skb(skb); + return; } err = iptunnel_xmit(rt, skb, fl4.saddr, fl4.daddr, protocol, -- cgit v1.2.3-58-ga151 From 67013282627185aeec2fb92c75868dcace0d25b4 Mon Sep 17 00:00:00 2001 From: Steffen Klassert Date: Tue, 1 Oct 2013 11:34:48 +0200 Subject: ip_tunnel: Add fallback tunnels to the hash lists Currently we can not update the tunnel parameters of the fallback tunnels because we don't find them in the hash lists. Fix this by adding them on initialization. Bug was introduced with commit c544193214 ("GRE: Refactor GRE tunneling code.") Cc: Pravin Shelar Signed-off-by: Steffen Klassert Signed-off-by: David S. Miller --- net/ipv4/ip_tunnel.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index dfc6d8a3caa7..895a3535321b 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -853,8 +853,10 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id, /* FB netdevice is special: we have one, and only one per netns. * Allowing to move it to another netns is clearly unsafe. */ - if (!IS_ERR(itn->fb_tunnel_dev)) + if (!IS_ERR(itn->fb_tunnel_dev)) { itn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL; + ip_tunnel_add(itn, netdev_priv(itn->fb_tunnel_dev)); + } rtnl_unlock(); return PTR_RET(itn->fb_tunnel_dev); -- cgit v1.2.3-58-ga151 From 78a3694d44a029242dd0830b34ab20ef1704be35 Mon Sep 17 00:00:00 2001 From: Steffen Klassert Date: Tue, 1 Oct 2013 11:35:51 +0200 Subject: ip_tunnel_core: Change __skb_push back to skb_push Git commit 0e6fbc5b ("ip_tunnels: extend iptunnel_xmit()") moved the IP header installation to iptunnel_xmit() and changed skb_push() to __skb_push(). This makes possible bugs hard to track down, so change it back to skb_push(). Cc: Pravin Shelar Signed-off-by: Steffen Klassert Signed-off-by: David S. Miller --- net/ipv4/ip_tunnel_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index d6c856b17fd4..c31e3ad98ef2 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -61,7 +61,7 @@ int iptunnel_xmit(struct rtable *rt, struct sk_buff *skb, memset(IPCB(skb), 0, sizeof(*IPCB(skb))); /* Push down and install the IP header. */ - __skb_push(skb, sizeof(struct iphdr)); + skb_push(skb, sizeof(struct iphdr)); skb_reset_network_header(skb); iph = ip_hdr(skb); -- cgit v1.2.3-58-ga151 From cfe4a536927c3186b7e7f9be688e7e0f62bb8ea1 Mon Sep 17 00:00:00 2001 From: Steffen Klassert Date: Tue, 1 Oct 2013 11:37:37 +0200 Subject: ip_tunnel: Remove double unregister of the fallback device When queueing the netdevices for removal, we queue the fallback device twice in ip_tunnel_destroy(). The first time when we queue all netdevices in the namespace and then again explicitly. Fix this by removing the explicit queueing of the fallback device. Bug was introduced when network namespace support was added with commit 6c742e714d8 ("ipip: add x-netns support"). Cc: Nicolas Dichtel Signed-off-by: Steffen Klassert Acked-by: Nicolas Dichtel Signed-off-by: David S. Miller --- net/ipv4/ip_tunnel.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'net') diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index 895a3535321b..63a6d6d6b875 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -886,8 +886,6 @@ static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head, if (!net_eq(dev_net(t->dev), net)) unregister_netdevice_queue(t->dev, head); } - if (itn->fb_tunnel_dev) - unregister_netdevice_queue(itn->fb_tunnel_dev, head); } void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops) -- cgit v1.2.3-58-ga151 From 205983c43700ac3a81e7625273a3fa83cd2759b5 Mon Sep 17 00:00:00 2001 From: Nicolas Dichtel Date: Tue, 1 Oct 2013 18:04:59 +0200 Subject: sit: allow to use rtnl ops on fb tunnel rtnl ops where introduced by ba3e3f50a0e5 ("sit: advertise tunnel param via rtnl"), but I forget to assign rtnl ops to fb tunnels. Now that it is done, we must remove the explicit call to unregister_netdevice_queue(), because the fallback tunnel is added to the queue in sit_destroy_tunnels() when checking rtnl_link_ops of all netdevices (this is valid since commit 5e6700b3bf98 ("sit: add support of x-netns")). Signed-off-by: Nicolas Dichtel Signed-off-by: David S. Miller --- net/ipv6/sit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index afd5605aea7c..19269453a8ea 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -1666,6 +1666,7 @@ static int __net_init sit_init_net(struct net *net) goto err_alloc_dev; } dev_net_set(sitn->fb_tunnel_dev, net); + sitn->fb_tunnel_dev->rtnl_link_ops = &sit_link_ops; /* FB netdevice is special: we have one, and only one per netns. * Allowing to move it to another netns is clearly unsafe. */ @@ -1700,7 +1701,6 @@ static void __net_exit sit_exit_net(struct net *net) rtnl_lock(); sit_destroy_tunnels(sitn, &list); - unregister_netdevice_queue(sitn->fb_tunnel_dev, &list); unregister_netdevice_many(&list); rtnl_unlock(); } -- cgit v1.2.3-58-ga151 From bb8140947a247b9aa15652cc24dc555ebb0b64b0 Mon Sep 17 00:00:00 2001 From: Nicolas Dichtel Date: Tue, 1 Oct 2013 18:05:00 +0200 Subject: ip6tnl: allow to use rtnl ops on fb tunnel rtnl ops where introduced by c075b13098b3 ("ip6tnl: advertise tunnel param via rtnl"), but I forget to assign rtnl ops to fb tunnels. Now that it is done, we must remove the explicit call to unregister_netdevice_queue(), because the fallback tunnel is added to the queue in ip6_tnl_destroy_tunnels() when checking rtnl_link_ops of all netdevices (this is valid since commit 0bd8762824e7 ("ip6tnl: add x-netns support")). Signed-off-by: Nicolas Dichtel Signed-off-by: David S. Miller --- net/ipv6/ip6_tunnel.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'net') diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 2d8f4829575b..a791552e0422 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -1731,8 +1731,6 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n) } } - t = rtnl_dereference(ip6n->tnls_wc[0]); - unregister_netdevice_queue(t->dev, &list); unregister_netdevice_many(&list); } @@ -1752,6 +1750,7 @@ static int __net_init ip6_tnl_init_net(struct net *net) if (!ip6n->fb_tnl_dev) goto err_alloc_dev; dev_net_set(ip6n->fb_tnl_dev, net); + ip6n->fb_tnl_dev->rtnl_link_ops = &ip6_link_ops; /* FB netdevice is special: we have one, and only one per netns. * Allowing to move it to another netns is clearly unsafe. */ -- cgit v1.2.3-58-ga151 From 0eab5eb7a3a9a6ccfcdecbffff00d60a86a004bb Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 1 Oct 2013 09:10:16 -0700 Subject: pkt_sched: fq: rate limiting improvements FQ rate limiting suffers from two problems, reported by Steinar : 1) FQ enforces a delay when flow quantum is exhausted in order to reduce cpu overhead. But if packets are small, current delay computation is slightly wrong, and observed rates can be too high. Steinar had this problem because he disabled TSO and GSO, and default FQ quantum is 2*1514. (Of course, I wish recent TSO auto sizing changes will help to not having to disable TSO in the first place) 2) maxrate was not used for forwarded flows (skbs not attached to a socket) Tested: tc qdisc add dev eth0 root est 1sec 4sec fq maxrate 8Mbit netperf -H lpq84 -l 1000 & sleep 10 ; tc -s qdisc show dev eth0 qdisc fq 8003: root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 maxrate 8000Kbit Sent 16819357 bytes 11258 pkt (dropped 0, overlimits 0 requeues 0) rate 7831Kbit 653pps backlog 7570b 5p requeues 0 44 flows (43 inactive, 1 throttled), next packet delay 2977352 ns 0 gc, 0 highprio, 5545 throttled lpq83:~# tcpdump -p -i eth0 host lpq84 -c 12 09:02:52.079484 IP lpq83 > lpq84: . 1389536928:1389538376(1448) ack 3808678021 win 457 09:02:52.079499 IP lpq83 > lpq84: . 1448:2896(1448) ack 1 win 457 09:02:52.079906 IP lpq84 > lpq83: . ack 2896 win 16384 09:02:52.082568 IP lpq83 > lpq84: . 2896:4344(1448) ack 1 win 457 09:02:52.082581 IP lpq83 > lpq84: . 4344:5792(1448) ack 1 win 457 09:02:52.083017 IP lpq84 > lpq83: . ack 5792 win 16384 09:02:52.085678 IP lpq83 > lpq84: . 5792:7240(1448) ack 1 win 457 09:02:52.085693 IP lpq83 > lpq84: . 7240:8688(1448) ack 1 win 457 09:02:52.086117 IP lpq84 > lpq83: . ack 8688 win 16384 09:02:52.088792 IP lpq83 > lpq84: . 8688:10136(1448) ack 1 win 457 09:02:52.088806 IP lpq83 > lpq84: . 10136:11584(1448) ack 1 win 457 09:02:52.089217 IP lpq84 > lpq83: . ack 11584 win 16384 Reported-by: Steinar H. Gunderson Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- net/sched/sch_fq.c | 45 ++++++++++++++++++++++++++------------------- 1 file changed, 26 insertions(+), 19 deletions(-) (limited to 'net') diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index fc6de56a331e..a2fef8b10b96 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -420,6 +420,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch) struct fq_flow_head *head; struct sk_buff *skb; struct fq_flow *f; + u32 rate; skb = fq_dequeue_head(sch, &q->internal); if (skb) @@ -468,28 +469,34 @@ begin: f->time_next_packet = now; f->credit -= qdisc_pkt_len(skb); - if (f->credit <= 0 && - q->rate_enable && - skb->sk && skb->sk->sk_state != TCP_TIME_WAIT) { - u32 rate = skb->sk->sk_pacing_rate ?: q->flow_default_rate; + if (f->credit > 0 || !q->rate_enable) + goto out; - rate = min(rate, q->flow_max_rate); - if (rate) { - u64 len = (u64)qdisc_pkt_len(skb) * NSEC_PER_SEC; - - do_div(len, rate); - /* Since socket rate can change later, - * clamp the delay to 125 ms. - * TODO: maybe segment the too big skb, as in commit - * e43ac79a4bc ("sch_tbf: segment too big GSO packets") - */ - if (unlikely(len > 125 * NSEC_PER_MSEC)) { - len = 125 * NSEC_PER_MSEC; - q->stat_pkts_too_long++; - } + if (skb->sk && skb->sk->sk_state != TCP_TIME_WAIT) { + rate = skb->sk->sk_pacing_rate ?: q->flow_default_rate; - f->time_next_packet = now + len; + rate = min(rate, q->flow_max_rate); + } else { + rate = q->flow_max_rate; + if (rate == ~0U) + goto out; + } + if (rate) { + u32 plen = max(qdisc_pkt_len(skb), q->quantum); + u64 len = (u64)plen * NSEC_PER_SEC; + + do_div(len, rate); + /* Since socket rate can change later, + * clamp the delay to 125 ms. + * TODO: maybe segment the too big skb, as in commit + * e43ac79a4bc ("sch_tbf: segment too big GSO packets") + */ + if (unlikely(len > 125 * NSEC_PER_MSEC)) { + len = 125 * NSEC_PER_MSEC; + q->stat_pkts_too_long++; } + + f->time_next_packet = now + len; } out: qdisc_bstats_update(sch, skb); -- cgit v1.2.3-58-ga151 From 2433c8f094a008895e66f25bd1773cdb01c91d01 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Sat, 5 Oct 2013 13:15:30 -0700 Subject: net: Update the sysctl permissions handler to test effective uid/gid Modify the code to use current_euid(), and in_egroup_p, as in done in fs/proc/proc_sysctl.c:test_perm() Cc: stable@vger.kernel.org Reviewed-by: Eric Sandeen Reported-by: Eric Sandeen Signed-off-by: "Eric W. Biederman" Signed-off-by: Linus Torvalds --- net/sysctl_net.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/sysctl_net.c b/net/sysctl_net.c index 9bc6db04be3e..e7000be321b0 100644 --- a/net/sysctl_net.c +++ b/net/sysctl_net.c @@ -47,12 +47,12 @@ static int net_ctl_permissions(struct ctl_table_header *head, /* Allow network administrator to have same access as root. */ if (ns_capable(net->user_ns, CAP_NET_ADMIN) || - uid_eq(root_uid, current_uid())) { + uid_eq(root_uid, current_euid())) { int mode = (table->mode >> 6) & 7; return (mode << 6) | (mode << 3) | mode; } /* Allow netns root group to have the same access as the root group */ - if (gid_eq(root_gid, current_gid())) { + if (in_egroup_p(root_gid)) { int mode = (table->mode >> 3) & 7; return (mode << 3) | mode; } -- cgit v1.2.3-58-ga151