Leveraging Global Accelerator for a self managed VPN in AWS.

First off, I have to address the fact that it’s been a long long time since I’ve posted anything here. Second and related statement:

Any views or information expressed here do not reflect that of AWS. This site is a personal project and should be treated as that. Just a lab space where I throw info and play with projects/ideas

With that out of the way, it’s the answer as to why I haven’t posted much here in a long time. I’ve moved across the country to Portland, OR. and I’m very happily working for a cloud provider we all know. This has all happened in the middle of Covid, so needless to say between moving and starting the new job I’ve been pretty busy. Admittedly there’s also been a lot of surfing but there’s no waves today so I’m doing my next favorite thing, nerding out!

Covid is actually one of the reasons I’m making this post. ThousandEyes released their annual internet performance report and focused heavily on the dynamic of WorkFromHome and how it has influenced network behavior across the web.

https://www.thousandeyes.com/resources/internet-performance-report-covid-19-impact

While cloud providers have maintained solid reliability, ISPs have demonstrated glaring holes in capacity and reliability. Allow me to pause on that statement.

“Cloud providers have maintained solid reliability”

Global Accelerator

This is the purpose of discussing Global Accelerator. GA is an AWS Service that allows an Admin/Engineer to allocate AnyCast IPs to resources in their AWS Environment. The purpose of this is to allow clients to have a closer route in order to on-board to the AWS Backbone with the intent of avoiding internet congestion and outages. While ISPs may be having issues amongst themselves, if we’re able to get Client Traffic onto a reliable network and skip the interruptions usual traffic would encounter, we can deliver a more reliable and consistent experience. While AWS may not have a DataCenter close to every client, it may have a POP or Edge Location. To see a list of GA POP/Edge locations or for more information on the service, visit the following.

https://aws.amazon.com/global-accelerator/faqs/

Global Accelerator and a self hosted VPN

AWS Site-To-Site VPN can leverage GA so long as Transit-Gateway is in use. This is great, but I wanted to see if I could implement GA on a self managed VPN solution. For example, some companies may forgo the built in VPN options that a cloud provider offers. This was what I did in my previous role, implementing vAppliance SD-WAN solution because of the ease in configuration/fail-over along with added features in regards to routing/traffic control. So can we combine something like this with GA? The short answer is yes, we can associated GA with an EIP or Instance-ID in AWS, allowing us to use a GA AnyCast IP address as the VPN Peer IP address. However, it depends on 2 key points in order for this to work.

Global Accelerator does NOT work for outbound connections. Because of this, the On-Prem or Remote VPN Peer MUST be the initiator. If the Resource in AWS behind GA attempts to initiate a VPN connection, it will come from it’s EIP and NOT the GA AnyCast IP address. Phase1/2 Parent/Child SA re-keys are then performed over an existing UDP/4500 session, so it does not matter if the Initiator or Responder preforms the re-key.
Global Accelerator does NOT allow IP protocol 50 (ESP) and only can be configured with Protocols 6 and 17 (TCP and UDP). Because of this, NAT-T MUST be used in order to encapsulate the ESP traffic in UDP/4500.

If those considerations are taken into account, I’m happy to report that we can leverage GA for our IPSec Tunnels on our vAppliances in the cloud.

What to expect

GA is not necessarily going to make a difference to to some clients, or perhaps any clients for that matter. This largely depends on their location and the network conditions typically encountered over their public route to a server/resource. In addition, GA still can’t make data move faster than the speed of light. We’re always going to see latency when traversing the globe. Until we master Quantum Entanglement, I think we’re stuck with this limitation. 😛

I’m lucky that my network provider here in Portland (CenturyLink) actually provides a great Gigabit connection and has proven reliable. Regardless, I’ll share what differences I experienced when using GA. I created a vAppliance is us-east-1 and tested latency and throughput from here in Oregon.

Here’s the public internet Route to the Public IP associated with the vAppliance. It’s notable that even though I’m not using GA here, I’m on-boarded to the AWS network by Hop 6. This is likely due to the large AWS presence here (us-west-2 is in Oregon) and this behavior would be different from geo-location to geo-location. Also, take LOSS with a grain of salt here. MTR is being pushed to it’s limits with this far of a route.

■■■■■■■■■■■■:~# mtr -P 80 -T 54.159.206.44 --max-ttl 100 --max-unknown 100 -c 50 --report
Start:
HOST: ■■■■■■■                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 172.17.0.1                 0.0%    50    0.6   0.4   0.3   0.6   0.1
  2.|-- ptld-dsl-gw52.ptld.qwest.  0.0%    50    2.4   2.5   1.2  11.4   1.6
  3.|-- ptld-agw1.inet.qwest.net   0.0%    50    2.6   3.7   2.1  22.5   3.3
  4.|-- cer-edge-19.inet.qwest.ne  0.0%    50   53.0  54.8  52.2 124.4  10.1
  5.|-- 65.113.250.30              0.0%    50   52.9  53.2  52.0  56.6   0.7
  6.|-- 52.93.249.27               0.0%    50   53.7  54.7  53.5  65.2   2.2
  7.|-- 52.95.62.95                0.0%    50   53.6  54.2  52.9  63.1   1.8
  8.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
  9.|-- 52.93.129.136              0.0%    50   75.5  76.1  74.5  86.1   2.3
 10.|-- 150.222.242.150           28.0%    50   75.4  75.5  73.9  81.0   1.5
 11.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 12.|-- 150.222.242.154           94.0%    50   79.1  76.4  75.1  79.1   2.3
 13.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 14.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 15.|-- 150.222.243.197            2.0%    50   75.4  76.2  74.2  90.5   3.2
 16.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 17.|-- 150.222.241.187           94.0%    50   75.4  75.4  75.2  75.6   0.2
 18.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 19.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 20.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 21.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 22.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 23.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 24.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 25.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 26.|-- 52.93.28.232              12.0%    50   74.8  75.7  74.2  83.0   1.9
 27.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 28.|-- 52.93.28.234              98.0%    50   75.1  75.1  75.1  75.1   0.0
 29.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 30.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 31.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 32.|-- ec2-54-159-206-44.compute 46.0%    50   74.1  75.0  74.1  75.4   0.3

Here’s the internal MTR seen over the VPN tunnel from my home to us-east-1

■■■■■■■■■■■■:~# mtr -P 80 -T 172.31.94.202 -c 10 --report
Start:
HOST: ■■■■■■■                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 172.17.0.1                 0.0%    10    0.5   0.4   0.3   0.6   0.1
  2.|-- 169.254.254.2              0.0%    10   75.6  75.2  74.3  75.6   0.2
  3.|-- 172.31.94.202              0.0%    10   75.8  75.6  75.1  77.1   0.3

And lastly, here’s the iperf performance using a single TCP Stream

■■■■■■■■■■■■:~# iperf3 -c 172.31.94.202 -P 1
Connecting to host 172.31.94.202, port 5201
[  5] local 172.17.0.25 port 55826 connected to 172.31.94.202 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  10.7 MBytes  89.6 Mbits/sec    0   2.46 MBytes       
[  5]   1.00-2.00   sec  18.8 MBytes   157 Mbits/sec    0   2.46 MBytes       
[  5]   2.00-3.00   sec  18.8 MBytes   157 Mbits/sec    0   2.46 MBytes       
[  5]   3.00-4.00   sec  18.8 MBytes   157 Mbits/sec    1   2.46 MBytes       
[  5]   4.00-5.00   sec  18.8 MBytes   157 Mbits/sec    0   2.46 MBytes       
[  5]   5.00-6.00   sec  17.5 MBytes   147 Mbits/sec    0   2.46 MBytes       
[  5]   6.00-7.00   sec  18.8 MBytes   157 Mbits/sec    0   2.46 MBytes       
[  5]   7.00-8.00   sec  20.0 MBytes   168 Mbits/sec    0   2.46 MBytes       
[  5]   8.00-9.00   sec  18.8 MBytes   157 Mbits/sec    0   2.46 MBytes       
[  5]   9.00-10.00  sec  18.8 MBytes   157 Mbits/sec    0   2.46 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   179 MBytes   151 Mbits/sec    1             sender
[  5]   0.00-10.00  sec   177 MBytes   149 Mbits/sec                  receiver

I want to pause on throughput. I stayed with a single TCP stream here because it better reflects Band Width Delay Product, or BWDP. Latency is a big factor in application throughput due to TCP window sizing. Depending on the Congestion Avoidance and TCP Window Scaling algorithm, high latency can butcher what is an otherwise high-throughput pipe. This really comes into play when sub-optimal protocols such as SMB are sent over high latency links. A quick google search will demonstrate how many Admins/Engineers have been bit by SMB over high latency links. Even worse, Database Protocols such as OBDC Connections. Any latency we can reduce will often have a positive impact on application performance.

For more information on BWDP, I highly recommend the following article (book) over on O’Reilly

https://hpbn.co/building-blocks-of-tcp/#bandwidth-delay-product

Now with GA

Disregard the last hop latency here as this is mis-leading. This is not the latency to the vAppliance, but rather to the AWS POP.

■■■■■■■■■■■■:~# mtr -P 80 -T 75.2.42.249 -c 50 --report
Start:
HOST: ■■■■■■■                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 172.17.0.1                 0.0%    50    0.3   0.3   0.3   0.5   0.1
  2.|-- ptld-dsl-gw52.ptld.qwest.  0.0%    50    2.0   3.5   1.3  40.8   6.7
  3.|-- ptld-agw1.inet.qwest.net   0.0%    50    2.2   2.8   1.2  11.3   1.8
  4.|-- tuk-edge-14.inet.qwest.ne  0.0%    50   26.9   7.7   4.9  64.4   8.9
  5.|-- 65-122-235-178.dia.static  0.0%    50    6.0   7.2   5.1  19.6   3.3
  6.|-- 52.95.54.182               0.0%    50    6.8   7.6   5.8  16.1   2.1
  7.|-- a33ff907a505eb902.awsglob  0.0%    50    5.5   5.8   5.1   7.5   0.3

Instead, consider the internal MTR over the VPN for an accurate representation of latency

■■■■■■■■■■■■:~# mtr -P 80 -T 172.31.94.202 -c 10 --report
Start:
HOST: ■■■■■■■                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 172.17.0.1                 0.0%    10    0.3   0.3   0.3   0.4   0.0
  2.|-- 169.254.254.2              0.0%    10   69.2  69.4  69.2  69.9   0.2
  3.|-- 172.31.94.202              0.0%    10   70.0  69.9  69.7  70.2   0.2

And the iperf results

■■■■■■■■■■■■:~# iperf3 -c 172.31.94.202 -P 1
Connecting to host 172.31.94.202, port 5201
[  5] local 172.17.0.25 port 55860 connected to 172.31.94.202 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  11.5 MBytes  96.2 Mbits/sec    0   3.02 MBytes       
[  5]   1.00-2.00   sec  20.0 MBytes   168 Mbits/sec    0   3.02 MBytes       
[  5]   2.00-3.00   sec  20.0 MBytes   168 Mbits/sec    0   3.02 MBytes       
[  5]   3.00-4.00   sec  20.0 MBytes   168 Mbits/sec    0   3.02 MBytes       
[  5]   4.00-5.00   sec  21.2 MBytes   178 Mbits/sec    0   3.02 MBytes       
[  5]   5.00-6.00   sec  20.0 MBytes   168 Mbits/sec    0   3.02 MBytes       
[  5]   6.00-7.00   sec  20.0 MBytes   168 Mbits/sec    0   3.02 MBytes       
[  5]   7.00-8.00   sec  20.0 MBytes   168 Mbits/sec    0   3.02 MBytes       
[  5]   8.00-9.00   sec  20.0 MBytes   168 Mbits/sec    0   3.02 MBytes       
[  5]   9.00-10.00  sec  20.0 MBytes   168 Mbits/sec    0   3.02 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   193 MBytes   162 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   191 MBytes   160 Mbits/sec                  receiver

Conclusion

While my use case here in Portland did not demonstrate spectacular results, it was clearly evident that there was an improvement. I can see that I’m already using AWS’s network by my 6th hop here in Oregon (whois 65.113.250.30). Even with that, I’m seeing a 5ms improvement in latency which translates into a larger Congestion Avoidance Window and an extra 10 Mbits/sec per TCP stream! This is also on a Saturday morning so we might expect different (worse) results over the public internet during peak evening or business hours. YMMV as results aren’t dependent on how great AWS’s network is, but rather how bad your ISP’s network is!