The occasional ECONNRESET

(movq.de)

95 points | by zdw 9 hours ago

7 comments

  • smarks 7 hours ago
    Part 2 shows this comment from the Linux TCP code:

        /* As outlined in RFC 2525, section 2.17, we send a RST here because
         * data was lost. To witness the awful effects of the old behavior of
         * always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk
         * GET in an FTP client, suspend the process, wait for the client to
         * advertise a zero window, then kill -9 the FTP client, wheee...
         * Note: timeout is always zero in such a case.
         */
    
    Ok, so the RST is explained and well justified by the literature. But what are the “awful effects” of sending FIN instead? Can someone explain?
    • jmalicki 5 hours ago
      The difference between RST and FIN shows up in read() as an ECONNRESET vs end-of-file reading 0.

      In some protocols, end-of-file has semantic meaning that all data has been transferred, and TCP is set up such that you should be able to rely on that - if you can't rely on that difference, it is a bug in a TCP stack along the way.

      FIN also has a sequence number, so you can wait to ACK it until you get the corresponding data if it is dropped or out of order.

      TCP RST says the other side won't be resending if not ACKed, it is reset. Further, the downloading client usually cannot even read any packets in the receive window either once an RST has been received - that might be hundreds of KB of missed data.

      RST and FIN are very semantically meaningfully different.

      Reading the post, if gunicorn is e.g. sending a 404 after seeing a POST to a path it doesn't know about before reading the body, the client will never get the 404 because gunicorn hasn't read the message body.

      This case is partly why "Expect: 100-continue" exists, so it will be properly handled, even if it does introduce an extra round-trip lag in the POST.

      It might be dangerous to have your protocol rely on a piece of TCP that is often incorrectly implemented.

    • buckle8017 7 hours ago
      Client sees a clean disconnect and I guess assumes thats the entire file?
      • amluto 7 hours ago
        The client has been SIGKILLed, so it’s not assuming anything. I wonder whether the comment is a typo and they meant to kill -9 the server instead.
        • jmalicki 5 hours ago
          Hypothetically if this was HTTP without a Content-length (like it used to be in the olden days), you could have a proxy server assume this is the entire file.
        • MrBuddyCasino 6 hours ago
          Perhaps a permanently hung connection because timeout is zero (=disabled?)?
          • muststopmyths 5 hours ago
            Seems plausible since FIN only means “I’m done sending” also called a “half close”.

            FTP has different data and command connections so the server may not have an outstanding read to detect the data connection break.

            But.. it should still clean up both when the command connection dies

  • toast0 8 hours ago
    Might want to read the section on Lingering Close from here:

    https://httpd.apache.org/docs/2.4/misc/perf-tuning.html

    • zokier 7 hours ago
      That seems very outdated? Doesn't `shutdown` resolve the problem here?
      • toast0 6 hours ago
        Shutdown only helps if it's used; but TFA didn't mention it. So they're going to have to relearn the lessons of the 90s?

        Also, I think state of the art hasn't really changed? If you don't want a reset, you need to read everything from the socket before you close. If you don't really care about a reset as long as it doesn't interrupt the reader, you can shutdown in your direction, and drop the socket off to something that will wait "long enough" before it closes. In an eventloop architecture, you can just put in as a deferred task; in process per connection, you should probably send the socket to a dedicated lingering closer process that doesn't interrupt your flow.

        • Dylan16807 23 minutes ago
          > Shutdown only helps if it's used; but TFA didn't mention it.

          There's a part 2. It's only linked at the top for some reason, not at the bottom.

          Part 2 says they tried shutdown and it didn't change anything.

  • kune 5 hours ago
    The RST (Reset) is sent to inform the client that the data it sent was not read by the server. The RST avoids here the 4-way handshake for the TCP connection closure and the long wait times, if the client doesn't behave normal.

    For the case here the server should call shutdown with SHUT_WR after sending the data and then drain the incoming data before closing the socket.

  • bayesnet 2 hours ago
    Really love this article. Opens with the problem statement and jumps straight into the investigation. Thanks for a very enjoyable read (and an rss feed!)
  • gunsch 6 hours ago
    A few months ago I was debugging a similar issue in a Go-based service layer, where frequent HTTP requests to the same domain kept making fresh TCP connections when I was expecting TCP conn reuse.

    In this situation we were discarding the HTTP response without reading it before closing, which kept Go from reusing the connection. I didn't dig quite as deep as this post's author, but I imagine the same RST behavior was happening under the hood.

  • Joker_vD 8 hours ago
    > Send off the data and close the socket. If there's data still pending to be read, this will cause a RST, I think.

    Um, yes? That's how TCP has been universally implemented for more than 30 years. See [0], 2.17 for discussion.

    [0] https://www.rfc-editor.org/rfc/rfc2525#page-50

    • eggnet 7 hours ago
      That’s for rx, not tx. When you close a socket with data in the send buffer, that does not trigger a RST. If you just close the socket normally.
      • pdonis 7 hours ago
        > When you close a socket with data in the send buffer

        That's not what's happening here. The server is closing the socket when there's data from the client that it hasn't read.

        • Joker_vD 7 hours ago
          Yep, and that makes implementing addition of "Connection: close" in an HTTP reply at the HTTP/1.1-server's side somewhat tricky: you ideally need to read all of the pipelined requests from the client before closing the connection, which is usually something you'd rather not do. But if you just close it, you risk your client getting a partial reply, so you better add "Content-Length"/"Transfer-Encoding: chunked" in your reply as well... but one common reason to do connection-close reply is when you don't know the content-length beforehand, so — I hope you implemented chunking correctly :)
          • jmalicki 4 hours ago
            Even if you just close it, and it RSTs, and you implement chunked etc. properly, the client can't even necessarily read all of the information already sent to the client.

            If the client does a read() after the RST gets there, any data in the receive window is gone, any packets that might have been dropped and need to be resent are gone, etc.

          • toast0 5 hours ago
            More explicit connection closing indications are one of the nice things of http/2. Of course, it's bundled with the silly multiplexing :(
  • jcalvinowens 5 hours ago
    As others have noted, this usually happens because both sides wrote data and one side didn't read it before calling close().

    Here's a little reproducer: https://gist.github.com/jcalvinowens/da57edda9a01ca9f4c4088a...

        $ gcc -O2 test.c -o test
        
        $ strace -e socket,connect,write,accept,read,close ./test --rx        
        <...>                                                                           
        socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3                                                                           
        accept(3, NULL, NULL)                   = 4                                                                            
        close(3)                                = 0                                                                            
        read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
        <...>
        read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
        read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 3035
        read(4, "", 4096)                       = 0
        close(4)                                = 0
        +++ exited with 0 +++
    
        $ strace -e socket,connect,write,accept,read,close ./test --tx
        <...>
        socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
        connect(3, {sa_family=AF_INET, sin_port=htons(31337), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
        write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 600000) = 600000
        close(3)                                = 0
        +++ exited with 0 +++
    
    ...versus:

        $ gcc -O2 -DWRITE_TO_SOCKET_BEFORE_READ test.c -o test
        
        $ strace -e socket,connect,write,accept,read,close ./test --rx
        <...>
        socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
        accept(3, NULL, NULL)                   = 4
        close(3)                                = 0
        write(4, "\250\3\0\0\0\0\0\0\250\3\0\0\0\0\0\0$\0\0\0\0\0\0\0$\0\0\0\0\0\0\0"..., 4096) = 4096
        read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
        <...>
        read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 997
        read(4, 0x7ffd45c2d3c0, 4096)           = -1 ECONNRESET (Connection reset by peer)
        <...>
        +++ exited with 1 +++
        
        $ strace -e socket,connect,write,accept,read,close ./test --tx
        <...>
        socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
        connect(3, {sa_family=AF_INET, sin_port=htons(31337), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
        write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 600000) = 600000
        close(3) 
        +++ exited with 0 +++