How I sent 500 million HTTP requests to 2.5 million hosts
@kannthu1|July 4, 2024 (4m ago)19,542 views
The answer is Go
.
My usecase was ethical hacking, I needed to send 500 million of non RFC compliant HTTP/1.1
requests to 2.5 million hosts in a short period of time - ideally in a couple of hours.
I went deep into the rabbit hole of HTTP/1.1
and Go
to optimize the process. I scaled it horizontaly with kubernetes, and optimized the code to use each CPU core as effective as possible. I even edited Go's HTTP library to speed up the process.
I chose Go
because it is simple, has great concurrency support, and is fast. + My smol Grug brain can understand it.
Yes, I tried implementing it in Rust, but unfortunately, my brain is too small for async tokio types magic. Go, on the other hand, allowed the JS developer to write this whole thing, this is quite a statement about the language.
How much is 500 million HTTP/1.1
requests?
You are probably wondering if it is a lot or not. It is.
If you used curl
to send all of those requests one by one from single machine it would take you 7.9 years (assuming 0.5s per request).
In real world it would me much slower, servers would rate limit you, and respond even slower than 0.5s
From data transfer perspective it is not that much:
- 500 million requests * 1 KB (average request size) ≈ 478 GB
- 500 million responses * 5 KB (average response size) ≈ 2.33 TB
The problem lies in different place...
What does it mean to send a single HTTP/1.1
request?
From your perspective it is just a single call
1
2resp, err := http.Get("https://example.com")
3
But under the hood? The HTTP lib is doing a lot of things for you.
- HTTP client resolves DNS
- HTTP client connects to another machine using TCP
- HTTP client does TLS handshake (generate and exchange cryptographic keys)
- HTTP client prepares HTTP request to send (normalize, encode)
- HTTP client sends HTTP request (headers and body)
- HTTP client waits for response and reads it
- HTTP client parses response (decode, normalize, and parse the response)
- Close the connection (optional)
Note: any of these steps can fail at any time, so you should be prepared to retry it multiple times.
How much time does it take to send a single HTTP request?
I know you will hate this, but it depends.
There are several factors that might impact the results:
- distance between you and the server
- server load
- internet speed
- path to the server
- size of the request
- size of the response
Results from my laptop
I measured all of my timings by sending HTTP/1.1
requests to subdomains of *.wordpress.com
, my test code was written in JS. (I chose wordpress.com because it has wildcard dns, so I can generate random subdomains to prevent any caching)
HTTP request timings
In my tests I measured each value 50 times. The size of downloaded body was 40kb, and the HTTP request was a simple GET request.
What can we learn from it?
Resolving DNS
record + opening new TLS
connection is really slow - in an example of calling performant website, my server spent ~160ms
before sending a signle byte of HTTP
request.
You probably didn't notice how slow it is, because your browser will open connection once, and reuse it for multiple HTTP
requests. (in case of HTTP/2
or HTTP/3
it is even faster)
The good thing is that most of it is just waiting for another server - we are not wasting a lot of CPU cycles. In my case, I want to send many HTTP/1.1
requests to many different hosts in different networks, so I can't rely on reusing connections.
There is another requirement that I did not mention before - we do not want to overwhelm any target server.
We could theoretically:
- Open
TLS
connection to the server - Send all of the
HTTP/1.1
requests in parallel - Close the connection
- Repeat for another server
But we would get rate-limited or banned very quickly.
So, our only option is to spread the load on many servers.
How can we optimize it?
First, let's think what we can remove from the entire equation. The best part is no part.
The best candidates are Request parsing
and DNS resolution
.
In my case, we do not want to prepare (parse and normalize) the HTTP/1.1
request at all, the whole thing is used for ethical hacking. We are crafting HTTP/1.1
requests by hand, we just want a pipe to send it.
For DNS resolution
, we can resolve all of the DNS
records before we start sending requests. Performant DNS
resolution is different breed, that I didn't want to touch it. I just used massdns
- it can resolve thousands of DNS records in a couple of seconds.
This way we can send requests to many servers without waiting for DNS
resolution.
Designing the HTTP/1.1
sending cannon
I like simplicity, so I used multiple worker pools that were piped together.
- Request generation pool
- Sender pool
- Response handler pool
HTTP request timings
I wanted to reuse as much objects and memory as possible, so I avoided creating new objects
and new goroutines
for each request. Each worker pool was separated by a concurrency safe queue
.
It worked great, each component was separated and could be scaled independently. + It was easy to debug and reason about.
Choosing right HTTP
library
I knew that the net/http
library in Go
is not the fastest one, so I had to look for alternatives. I ended up with fasthttp
library, it is a low-level HTTP
library that is optimized for speed.
According to the benchmarks fasthttp client is up to 10 times faster than net/http
I created n
(number of HTTP Sender
goroutines) instances of fasthttp.Client
objects, and each HTTP Sender
worker was acquiring a client from the pool, sending the request, and releasing it back to the pool.
Optimizing the library
The library was fast, but I wanted to squeeze every bit of performance from it. I notticed that before sending HTTP request, the library was normalizing each HTTP
request. It was a waste of CPU cycles, because I was crafting the HTTP/1.1
requests by hand.
So, I just forked the library and removed the normalization step. It was a simple change, but it saved a lot of CPU cycles.
I also implemented custom SetRequestRaw
method that would allow me to send raw non RFC complaiant HTTP/1.1
requests.
In my version, I could simply do:
1req := rawfasthttp.AcquireRequest()
2resp := rawfasthttp.AcquireResponse()
3
4rawBytes := []byte("GET / HTTP/1.1\r\nHost: example.com\r\n\r\n")
5
6req.SetRequestRaw(rawBytes)
7
8err := client.Do(req, resp)
9
Skipping DNS resolution
For each request I would just override the Dial
function and directly connect to the resolved IP address.
1// single instance of custom dialer
2customDialer = &rawfasthttp.TCPDialer{}
3
4req := rawfasthttp.AcquireRequest()
5resp := rawfasthttp.AcquireResponse()
6
7resolved_ip := "127.0.0.1"
8
9req.SetDial(func(addr string) (net.Conn, error) {
10 return customDialer.Dial(resolved_ip)
11})
12
13// ...
14
TLS handshake optimization
After profiling the code, I found that the TLS handshake
had a big waste of CPU cycles. The TLS
was generating secure cryptographic keys.
Since I am doing hacker things, I do not really care if the keys are secure or not. I just want to send the request as fast as possible.
One simple solution would be to hardcode the keys, but I did not have enough time to implement it. It would require me to do many changes in my fork of fasthttp
library - so I skipped it for now.
Splitting the work into chunks
I had 2.5 million hosts and wanted to send ~200 HTTP
requests to each host. So I needed to chunk it somehow.
The chunks should be small enough to not take more than a couple of minutes to complete, but big enough to not waste time on creating new connections.
I decided to split the work into 200 hosts per chunk. Each chunk would be executed by a single worker.
The goal would be to send all of the HTTP/1.1
requests to each host of those 200 hosts, and then move to the next chunk.
Why 200? It was a number that I found to be optimal for my usecase. The worker pods were prone to fail, so I wanted to minimize the number of requests that would be lost and had to be retried.
Scalling the cannon with Kubernetes
I used Kubernetes on DigitalOcean to scale the cannon. It wasn't my first choice, but since I was sending terabytes of data to internet
it was the only cloud that would not require me to sell my kidney to pay for it.
DigitalOcean generously gives you 2TB+ of bandwidth per droplet.
Custom autoscroller
I wrote a simple JS script that would call Kubernetes API to scale the Deployment up and down based on how many targets I had in queue. It scaled from 0
to 60
pods in a couple of minutes.
DDoS as a Service
I was generating so much traffic that I DDoSed myself - the DigitalOcean's network could not handle my shenanigans.
Each HTTP cannon
pod was pinging queue service
to prove that it is still alive, but since I was generating so much traffic, the pings wouldn't be sent at all, and after a couple of minutes kubelet would kill the pod.
It was another reason to decrease the number of hosts per chunk.
Avoiding detection
Another cool feature of DigitalOcean is that they give you new public IP for each droplet. It was great, because for each scan I would get new IP address.
So I was sending HTTP/1.1
requests from different IP addresses at the same time, and even when Cloudflare
would ban me, the next day I would get new IP address.
Conclusion
The final results were kinda impressive:
- Each pod achieved 100-400 requests per second
- Scaled to 60 pods
- Sent 500 million HTTP/1.1 requests to 2.5 million hosts in just a couple of hours
What was the result? I will tell you in the next post.
🚀 We are hiring
Founding Backend Engineer
Here is what you will do:
- Build state-of-the art code scanning tools (based on LLMs)
- Integrate LLMs and build RAG pipelines
- Build parsers to represent code as graph
- Optimize scanning pipeline to work near realtime
- Scale to millions of snippets of code
Contact me at ZGF3aWRAdmlkb2NzZWN1cml0eS5jb20= (figure it out yourself)