<div dir="ltr">Okay, I think we're done with the heavy testing. there are still a few loose ends I'd like to wrap up, but I think Sander and I agree that it's time to put this thread's concerns to rest. The results may be disappointing in a way, but at least we didn't come out empty-handed.<br><br>Summary first:<br><br>Jool, even 3.5.4 Jool, withstands T-Rex's torture traffic without flinching. There are no significant performance issues to worry about. CPU usage is at 1% at worst and there are no packet drops. The bottleneck was some hardware quirk or some software misconfiguration that, unfortunately, we never spotted.<br><br>(Come to think of it, T-Rex uses more CPU than Jool. At least when BIB logging is turned off. I didn't expect that--considering T-Rex uses DPDK.)<br><br>I'm going to detail my thoughts now. If all you wanted was the outcome of the performance testing voyage, that was it, so you don't need to read this mail any further.<br><br>So, basically, I have personally not seen any traces of any struggle Jool has ever gone through while being pummeled by T-Rex's traffic. I can't explain Sander's soft lockup at all, but for as long as I have run the test, the machine has remained responsive and `top` always reports minimal CPU utilization. (As long as I turn BIB logging off; rsyslog does seem to struggle due to the logging volume.) From its perspective, Jool is translating packets without even noticing disruptions. Packets are being translated uneventfully and the kernel returns success whenever Jool trusts a packet to it. (Although admittedly, that might or might not mean that the packet was successfully delivered due to the random and imperfect nature of networks.) Whatever was causing mayhem, Jool wasn't even aware of it.<br><br>So I'd like to clarify:<br><br>> it also seems that session creation has a higher priority than packet forwarding<br><br>I will not deny this, but I should mention that this is not so much a priority issue as much as a natural consequence of the fact that the translated packet is sent *after* it is translated. Once Jool has done its job, it is up to the kernel whether the packet is posted on the interface, and even then, it is up to the network to deliver the packet to the destination.<br><br>We used to get very strange performance reports (speed going up and down with massive drop rates), until at some point in the past month, the issue appeared to (seemingly independently of anything I did) fix itself. And that's why I'm down on this whole ordeal: We can't go back to the old setup to continue testing. There is no way to find the root of the problem now. I had hoped this whole effort would result in a documentation note akin to the offload warning, so users could be aware of it, but there is nothing I can do anymore. All I can hope is that this was a hardware bug which silently prevented offloads from being turned off or something like that, and which eventually fixed itself as a result of my constant random tweaking everywhere.<br><br>But at least, I can now say fairly confidently that Jool is pretty darn fast, even without the latest performance tweaks applied, as evidenced from the fact that, now that whatever was hobbling before is gone, it is pretty clear that Jool can keep up to at least this configuration of T-Rex with flying colors.<br><br>Well, it is what I personally am getting out of the output anyway. In case anyone wants to review T-Rex's latest report, I have attached it to this mail along with one of the old/failing ones. Here's what I believe is the official documentation of this output: <a href="https://trex-tgn.cisco.com/trex/doc/trex_manual.html#_trex_output">https://trex-tgn.cisco.com/trex/doc/trex_manual.html#_trex_output</a>. The numbers I'm basing my claims off are the drop-rate (outright zero), the "Total-Tx" and "Total-Rx" numbers (always close to each other) and the active flows ("m_active_flows" and "m_est_flows"; also always close to each other). One thing that has baffled me this whole time is the Transmission Bandwidth ("Tx Bw"). I don't understand why would it be so different for each port, but I don't really understand where does T-Rex gets this information in the first place.<br><br>When I said that we didn't come out empty-handed I was referring to a little bottleneck I spotted while looking at Jool's numbers. Turns out that the session timer can hold a spinlock for a bit irresponsibly long (I probably underestimated those tree removal rebalances), which keeps packets being translated in the meantime waiting. It is not something overly significant, because it affects the maximum translation timestamp far more than the average one, but for as much a micro-optimization it is, addressing it would certainly be more productive than dealing with the cache-missing that usual kernel developers seem to obsess over. But it's not going to be noticeably faster either, which is why it should be safe to postpone to the release after the upcoming one.<br><br>Regarding the BIB-pool4 optimization I explained a few mails back: This wasn't the problem at all because T-Rex's traffic was not enough to exhaust pool4. I believe I will still end up merging the solution into master regardless.<br><br>Finally, I need to announce that I'm going out for a week's worth of vacations starting tomorrow so I won't be able to merge all the latest work until then. In the meantime, since there are no severe performance issues to note, using the latest version of Jool shouldn't be too hazardous. I do recommend master over 3.5.4 due to the latest bugfix, though.<br><br>I'm still keeping an eye on the list.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Sep 15, 2017 at 2:03 PM, Alberto Leiva <span dir="ltr"><<a href="mailto:ydahhrk@gmail.com" target="_blank">ydahhrk@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>It's strange that the third stream pushed the SIIT further when the NAT64 was already at 100% it the second stream. I'd have expected the load to be the same  since the NAT64 could not, in theory, send the SIIT any more packets. Or am I misunderstanding something?<br><br></div>Also, I'm really stumped that you managed to peak using only two streams. This should render all database lookups practically instantaneous. If this is the case then... yeah. I guess there's not much I will be able to do from code.<br><br></div><div>But we'll see.<br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Sep 15, 2017 at 12:18 PM, Alan Whinery <span dir="ltr"><<a href="mailto:whinery@hawaii.edu" target="_blank">whinery@hawaii.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div text="#000000" bgcolor="#FFFFFF">

    <br>

    Sort of a digression, but since Alberto referred to Linux router

    performance -- <br>

    <p>After I got the Jool/Jool v6-only NAT64/bitw-464CLAT scenario

      working, I tried some file transfers at 100 Mbps to v4-numeric

      addresses, so it was hitting both boxes (VMs, actually).  Watching

      the software interrupt load with top, I was getting around 10%

      load on the first 100 Mbps stream and a second stream pushed NAT64

      to 100% load on SI, while CLAT was only doing about  30%, third

      stream 100% NAT64, 90% CLAT. <br>

    </p>

    <p>Attached PDF is what I wrote when I still remembered, about

      increasing cores and spreading CPU affinity to mitigate. <br>

    </p>

    <p>The point being that there are things to be understood about

      Linux router performance, in tandem with NAT64/SIIT performance.

      For one, rolling in the right off-loading, coalescence, etc, as

      well as CPU affinity to tune the box like a router, rather than as

      a host. This stuff is might be a problem well before you get to

      the network scale that has been tested with TRex.<br>

    </p><div><div class="m_7126646491090073336h5">

    <br>

    <div class="m_7126646491090073336m_1196453893782140520moz-cite-prefix">On 9/15/2017 5:49 AM, Alberto Leiva

      wrote:<br>

    </div>

    </div></div><blockquote type="cite"><div><div class="m_7126646491090073336h5">

      <div dir="ltr">

        <div>

          <div>

            <div>Thank you!<br>

            </div>

            <div><br>

              > One thing I have been wondering about is if the TRex

              side gets confused and Jool is actually ok. If that is the

              case then I apologise!<br>

              <br>

            </div>

            <div>Well, who knows. I'm thinking that, if a normal Linux

              router would pass a similar test but a NAT64 Linux with

              Jool doesn't, then there should in theory be something

              that can be done.<br>

            </div>

            <div><br>

              > What would be the best way to check that? Massive

              pcaps?<br>

              <br>

            </div>

            I will compile a version with a bunch of timestamp tracking

            and see if we can get some conclusions out of it.<br>

            <br>

          </div>

        </div>

        Working...<br>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On Fri, Sep 15, 2017 at 5:24 AM, Sander

          Steffann <span dir="ltr"><<a href="mailto:sander@steffann.nl" target="_blank">sander@steffann.nl</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

            <span><br>

              > Okay, guys. Prototype ready. I didn't test a

              gazillion connections, but as far as basic functionality

              goes, it looks stable. Don't quote me on that, though.<br>

              ><br>

              > Experimental branch in fake-nat64, in case anyone

              wants to try it out: <a href="https://github.com/NICMx/Jool/tree/fake-nat64" rel="noreferrer" target="_blank">https://github.com/NICMx/Jool/<wbr>tree/fake-nat64</a><br>

              <br>

            </span>Sorry, it still collapses :(<br>

            <br>

            I recorded a small test here: <a href="http://www.steffann.nl/sander/Fake%20NAT64%20collapse.mov" rel="noreferrer" target="_blank">http://www.steffann.nl/sander/<wbr>Fake%20NAT64%20collapse.mov</a><br>

            <br>

            The behaviour is really strange. One thing I have been

            wondering about is if the TRex side gets confused and Jool

            is actually ok. If that is the case then I apologise! What

            would be the best way to check that? Massive pcaps?<br>

            <br>

            Cheers,<br>

            Sander<br>

            <br>

          </blockquote>

        </div>

        <br>

      </div>

      <br>

      <fieldset class="m_7126646491090073336m_1196453893782140520mimeAttachmentHeader"></fieldset>

      <br>

      </div></div><span><pre>______________________________<wbr>_________________

Jool-list mailing list

<a class="m_7126646491090073336m_1196453893782140520moz-txt-link-abbreviated" href="mailto:Jool-list@nic.mx" target="_blank">Jool-list@nic.mx</a>

<a class="m_7126646491090073336m_1196453893782140520moz-txt-link-freetext" href="https://mail-lists.nic.mx/listas/listinfo/jool-list" target="_blank">https://mail-lists.nic.mx/list<wbr>as/listinfo/jool-list</a>

</pre>

    </span></blockquote>

    <br>

  </div>

<br>______________________________<wbr>_________________<br>

Jool-list mailing list<br>

<a href="mailto:Jool-list@nic.mx" target="_blank">Jool-list@nic.mx</a><br>

<a href="https://mail-lists.nic.mx/listas/listinfo/jool-list" rel="noreferrer" target="_blank">https://mail-lists.nic.mx/list<wbr>as/listinfo/jool-list</a><br>

<br></blockquote></div><br></div>

</div></div></blockquote></div><br></div>