Intercept HTTP requests with Squid

On one of my projects we had some questions about how much bandwidth was being used by requests to a third party service but we didn’t have any a view beyond general traffic on the network interface. I hit upon the idea of using a transparent proxy to log requests then use log analysis to break out data transfer amounts per third party service. And since we already had squid as part of our infrastructure applications it seemed like a good choice.

The tricky part of this setup is that everything is hosted on the same hardware node and we also have some web services that needed to be left untouched. These requirements implied some network configuration using iptables to force outbound web requests through the proxy.

So the first thing I needed to do was install squid. On this project we use CentOS on all our hosts so this was easily accomplished like this:

sudo yum install squid

Next was adjusting the configuration. The default squid.conf comes with lots of documentation which is helpful but makes the configuration file difficult to read and navigate so the first thing I did was get rid of it like so:

cd /etc/squid && sudo cp squid.conf squid.conf.orig && sudo egrep -v'^#' squid.conf > /tmp/squid.conf

This leaves a lot of empty lines in the file which can be removed like this:


sudo sed '/^$/d' /tmp/squid.conf > /tmp/squid.clean && sudo mv /tmp/squid.clean squid.conf

Next up was setting up networking and squid.

The squid site has a great set of examples, one of which looked like it suit my purposes nicely.

First I configured squid by adding this directive:

http_port 3128 transparent

Then I started squid:

sudo service squid start

I also wanted to make sure squid starts when the system is rebooted:

sudo chkconfig --levels 2345 squid on

Next up was network configuration.

I needed to setup iptables with some NAT rules to force requests through the proxy server. The first command clears out any existing rules. If you already have a custom kernel network config, use this with caution:

sudo iptables iptables -t nat -F

The next rule is for a typical transparent proxy setup. In the setup that I was working with I did not need this rule, something I discovered by disabling the existing web sites with this command. So if you have a web server DO NOT do this:


sudo iptables -t nat -A PREROUTING -p tcp -i eth0 --dport 80 -j REDIRECT --to-port 3128

Here is the start of the iptables configuration we implemented.

Apply the rules to force local HTTP traffic through the transparent proxy:


gid=`id -g squid`
sudo iptables -t nat -A OUTPUT -p tcp --dport 80 -m owner --gid-owner $gid -j ACCEPT
sudo iptables -t nat -A OUTPUT -p tcp --dport 80 -j DNAT --to-destination HOSTIP:3128

Replace the string “HOSTIP” with the IP address of the host you’re configuring.

At this point I needed to test the setup so I tailed the access log. Or at least I tried to. The default directory permissions on the /var/log/squid directory prevented me from viewing its’ contents. I fixed that with this:

sudo chmod 0775 /var/log/squid

Then I was able to tail the /var/log/squid/access.log. So I created a request to see if it was logged:

wget http://www.google.com

I saw the request logged in the squid access log so I was satisfied that it and the networking were functional. However the log format wasn’t what we needed to feed to awstats, which is what I going to use to process the log.

Since we already use squid and process its’ logs I grabbed the configuration from our production configuration file:


logformat combined %>a %ui %un [%{%d/%b/%Y:%H:%M:%S %z}tl] "%rm %ru HTTP/%rv" %Hs %h" %Ss:%Sh
access_log /var/log/squid/access.log combined

Then I restarted squid and tested again. It looked good so I tested one of our batch processes that makes HTTP requests to make sure that it did what I wanted. It did however I noticed that query strings from the URI were not being logged. A quick google told me that I needed to update the squid.conf with this:

strip_query_terms off

As it turns out squid squid strips query string after the “?” by default. This is apparently to “protect privacy” but we needed the query string to identify individual requests more accurately.

At this point I had the system setup and working. It logged all the outbound HTTP requests and the existing web services remained unaffected. All that was left to do was setup awstats to process the logs.

Leave a Reply