This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

Using squidGuard for content filtering

By Keith Winston on March 01, 2007 (8:00:00 AM)

Share    Print    Comments   

Content filtering for the Web can be a messy proposition. A business may need to block only the most objectionable Web sites, while schools may be required by law to follow a more thorough process. Whatever your needs, you can build a solution with only open source pieces: squid, squidGuard, and blacklists.

The squid server acts as an intermediary between a Web browser and Web server. As a proxy, it receives a URL request from the browser, connects to the server on behalf of the browser, downloads content, then provides it to the browser. It also saves the content to disk so it can provide it more quickly to another browser if the same URL is requested in the near future. Generally, this leads to more efficient utilization of an Internet connection and faster response times for Web browsers.

A typical hardware setup uses physical two network cards on the proxy server. One connects to the internal network, where squid listens for incoming HTTP requests on the default port 3128. The other connects to the Internet, from which it downloads content.

Squid is available for most Linux distributions as a standard package. I was able to get squid running on Red Hat Linux with sane defaults by simply installing the RPM and setting a few options in the /etc/squid/squid.conf configuration file:

visible_hostname your-server-name
acl our_networks src 192.168.0.0/16
http_access allow our_networks
http_access deny all

The visible_hostname tells squid the name of the server. The acl is an access control list used in the http_access rule to allow internal clients to connect to squid. For security reasons, it is important to ensure that users outside your network can't use squid; this is achieved by adding a deny rule near the bottom of your configuration.

Tell the browsers

Most Web browsers behave a little differently when they know they are talking to a proxy server. In Firefox 2.0, you enter proxy settings under Tools -> Options (Firefox - Preferences on Mac) -> Advanced section -> Network tab, then click the Settings button under Connection.

Firefox proxy
Firefox proxy

Once the browser is configured, it should make requests and get responses from squid.

Another way to use squid is in transparent proxy mode. Transparent proxies are often used to force Web traffic through the proxy regardless of how each browser is configured. Doing so requires some network trickery to hijack outgoing HTTP requests and also requires additional tweaks to squid. You can read useful guides for configuring squid as a transparent proxy elsewhere.

Redirectors

With no additional configuration, squid faithfully fetches and returns each URL requested of it. To filter the content, squid has a feature called a redirector -- a separate program called by squid that examines the URL and tells squid to either proceed as usual or rewrite the URL so squid returns something else instead. Most often, redirectors rewrite banned URLs, returning the URL of a custom error page that explains why the requested URL was not honored.

Several third-party redirectors have been written, including squirm and squidGuard. Both squirm and squidGuard are C language programs that need to be compiled from source. Squirm operates using regular expression rules, while squidGuard uses a a database of domains and URLs to make decisions. I have not done any performance testing on redirectors, but squidGuard has a reputation for scaling well as the size of its blacklist increases. In my experience, squidGuard has performed well on networks with up to a thousand users.

Installing squidGuard 1.2.0

The squidGuard redirector is installed using the familiar "configure, make, make install" routine. One requirement that may not be installed on your system is the Berkeley DB library (now owned by Oracle), which squidGuard uses to store blacklist domains and URLs.

After running make install using the squidGuard source, I discovered that some directories were not created. I manually created the following directories:
/usr/local/squidGuard/ -- for configuration files
/usr/local/squidGuard/log/ -- for log files
/usr/local/squidGuard/db/ -- for blacklist files

Next, copy the sample configuration file to /usr/local/squidGuard/squidGuard.conf. We'll come back to the squidGuard configuration shortly.

To make squid aware of squidGuard, add these options to /etc/squid.conf:

redirect_program /usr/local/bin/squidGuard -c /usr/local/squidGuard/squidGuard.conf
redirect_children 8
redirector_bypass on
The redirect_program option points to the redirector binary and configuration file. The redirect_children option controls how many redirector processes to start. The redirector_bypass option tells squid to ignore the redirector if it becomes unavailable for some reason. If you do not set this option and squidGuard crashes or gets overloaded, squid will quit with a fatal error, perhaps ending all Web access.

Using a blacklist

To be effective as a filter, squidGuard needs a list of domains and URLs that should be blocked. Building and maintaining your own blacklist would require a huge investment in time. Fortunately, you can download a quality list and refresh it as it gets updated. One of the largest and most popular blacklists is maintained by Shalla Security Services.

The Shalla list contains more than one million entries categorized by subject, such as pornography, gambling, and warez. You can use all or any part of the list. The list is free for noncommercial use. For commercial use, a one-page agreement needs to be signed and returned to Shalla, but there is no cost to use the list unless it is embedded and resold in another product. Additional free and non-free blacklists are available, but the Shalla list is a good place to start.

To use it, download and unpack it in temporary directory. It will create a directory called BL with subject subdirectories below. Copy the directory tree below BL to the /usr/local/squidGuard/db/ directory. When you are done, the db directory should contain the subject subdirectories.

The blacklist itself is a set of plain text files named domains and urls. To allow squidGuard to use them, the text files must be loaded into Berkeley DB format. Before running the conversion process, return to the squidGuard.conf file and define which files you want to use.

The following is a basic squidGuard.conf configuration:

#
# CONFIG FILE FOR SQUIDGUARD
#
dbhome /usr/local/squidGuard/db
logdir /usr/local/squidGuard/log

# DESTINATIONS
dest spy {
        domainlist spyware/domains
        urllist spyware/urls
        log /usr/local/squidGuard/log/blocked.log
}

# ACCESS CONTROL LISTS
acl {
        default {
                pass !spy !in-addr all
                redirect http://webserver.com/blocked.html
        }
}

The dest block defines lists of domains and URLs, used later in the access control section. The example defines a "spy" destination using the spyware blacklist files defined with relative paths to the files in the db directory. It also uses the log option to write records to the blocked.log file when a match is found. The name and location of the log file can be changed.

The acl block defines what squidGuard does with requests passed to it from squid. The example instructs squidGuard to allow all requests that do not match the "spy" destination and are not IP addresses. The redirect option defines what URL to return if a request does not pass. So, if a request matches our blacklist, it gets redirected to the blocked.html page. It is also possible to set up a CGI script that can collect and report additional information, such as the user, source IP, and URL of the request.

The squidGuard configuration can be arbitrarily complex. I recommend starting out with a simple configuration and slowly adding to it and testing it until it meets your requirements.

Returning to the blacklist, it is time to run the Berkeley DB load process, using squidGuard to create the database files. This command starts the conversion process:

 /usr/local/bin/squidGuard -C all

With this command, squidGuard looks at its configuration file and converts the files defined. In the example, it would only convert the spyware lists, creating the files spyware/domains.db and spyware/urls.db. The loading process can take a while, especially on older hardware.

I ran into an issue with file permissions on the blacklist databases. If the files did not have permissions of 777, squidGuard was not able to use them. Even though the squidGuard processes ran as user squid and the files were owned by user squid with permissions of 755, squidGuard did not work as expected. In my setup, this was not a big problem because squidGuard was running on a standalone firewall. However, on a multi-user system, it would be a serious concern.

Using a whitelist

There are a couple of approaches to setting up a whitelist. One option is to create a whitelist directory under the squidGuard db directory and manage the whitelist using squidGuard ACLs. Another option is to create a file, such as /etc/squid/whitelist, and manage the exceptions with squid. Both options are effective, but I decided to manage the exceptions in squid for two reasons: it would eliminate a call to squidGuard, and it would be faster to modify. If the whitelist were maintained by squidGuard, squid would have to be restarted to make the changes active. With the whitelist maintained by squid, a much faster squid reload (re-reading the configuration file) is all that is required.

To configure the whitelist in squid, two extra options are needed in /etc/squid.conf:

acl white /etc/squid/whitelist
redirector_access white deny

The first option defines an access control list using the whitelist file. The whitelist file contains domain names (i.e., .youtube.com), one per line. The second option tells squid to skip the call to squidGuard if the URL is in the whitelist. The options must be defined in the order shown; the ACL must be defined before it is used.

Debugging and tuning

Both squid and squidGuard create useful log files. The primary squid log file is /var/log/squid/cache.log file. Squid is very clear when certain problems arise with the redirector. For example, these messages appeared in the squid log during my first full day of production using squidGuard:

WARNING: All redirector processes are busy.
WARNING: 5 pending requests queued
Consider increasing the number of redirector processes in your config file.

The setting in squid.conf for the number of redirectors is redirect_children, so correcting this was straighforward. Other issues may be more subtle.

Squid provides excellent internal diagnostic reports through squidclient, a program included with the squid pacakge. Use the following command on the machine where squid is installed to get general stastistics:
squidclient mgr:info

Use this command to see a report on the performance of the redirectors:
squidclient mgr:redirector

When squidGuard has a problem, it may not be as precise. A common error you may see in the squidGuard log is going into emergency mode. There may be additional helpful messages in the log file, but emergency mode usually means that squidGuard has stopped working. Often, there is a syntax error in the configuration file, but it could be a permissions issue or something else. You can test a squidGuard configuration from the command line before committing changes. Simply feed a list of URLs to squidGuard on the command line, using your test configuration file, and see if it returns the expected result. A blank line means squidGuard did not change the URL, while any other result means the URL was rewritten.

The long arm of the squid

Squid and squidGuard offer a reliable, fast platform for Web content filtering. If squidGuard doesn't meet your needs, additional redirectors are available, or you can roll your own. In addition to blacklisting, the redirector interface can be used to remove advertising, replace images, and other creative things. Content filtering with squid can be as coarse or as fine-grained as your needs.

Share    Print    Comments   

Comments

on Using squidGuard for content filtering

Note: Comments are owned by the poster. We are not responsible for their content.

What's New In Squid?

Posted by: Anonymous Coward on March 01, 2007 11:40 PM
It's been a few years since I've used Squid in a production environment and I wonder how much has changed. When last I used it, Squid was great but, it had one major(to me and my clients) shortcoming when compared to other caching proxies like Microsoft ISA and Novell BorderManager. While squid would filter on IP address quite nicely, filtering on userID or group was very difficult and impractical because it lacked transparent authentication whereas, ISA and BorderManager did it seemlessly.

So, my question is; Has squid changed to allow easy and seemless access control (transparent authentication) by user and group? In commercial offerings it is easy and straightforward to configure a particular user or group's ability to access particular URL's or other addresses and protocols. Can Squid do this, reliably and on multiple platforms(Windows & Linux), yet?

#

Re:What's New In Squid?

Posted by: Anonymous Coward on March 02, 2007 12:42 PM
There is an add-on for squid that lets you authenticate against an NTLM database, either Windows or Samba, and control access by user or group but it not transparent. You have to authenticate once for each browser session. It may also work with NIS and regular passwd/group files.

If I remember correctly, ISA uses proprietary headers and only works with IE. Not sure about Border Manager.

#

Re:What's New In Squid?

Posted by: Anonymous Coward on March 03, 2007 12:03 AM
Thanks for the update. I'm sorry to hear that the functionality is still missing.

For the record ISA and BorderManager rely on a client side applet that handles the authentication behind the scenes. They both work with any browser and most any other application too. The applet tells the proxy server who is logged on to the workstation and the proxy server can then allow access based on user, group, IP addresses, protocol, url, time, etc.

#

Re:What's New In Squid?

Posted by: Anonymous Coward on March 07, 2007 05:54 AM
So if you have to actually load a client applet on any workstation using the ISA proxy, why exactly is it described as being transparent?

The problem you encounter in web proxying is that you can either be transparent, i.e. no configuration required, or you can have authentication. One or the other.

The reason is that if the browser doesn't know it's being proxied, it can't pass in the authentication information. I hear that ISA server can do it, but then I hear you have to have a client loaded on the PC. Which means it's NOT transparent at all.

So maybe you mean something different by "transparent"?

#

what about malware filtering?

Posted by: Anonymous Coward on March 23, 2007 03:49 PM
Could anybody recommend simple and reliable malware filtering based on squid/3d_party_av solution? I know what freshmeat is<nobr> <wbr></nobr>:-). I just asking advice from someone who uses it already. Thank you!

#

content filtering

Posted by: Administrator on March 02, 2007 11:40 AM
Squidguard relies on blacklists.

I use dansguardian which actually inspects each page requested and if it has a combination of words that seem inappropriate will block it.


Dansguardian works with blacklists too and its phrase matching can be adjusted depending on how precious your users are.




Isn't totally free for businesses though.






www.dansguardian.org



#

Re: content filtering

Posted by: Anonymous [ip: 70.190.31.228] on September 28, 2007 04:43 AM
Where can I find such a blacklist? I have a small filtered search engine which relies soley on text.

#

Using squidGuard for content filtering

Posted by: Anonymous [ip: 125.17.146.2] on August 02, 2007 10:09 AM
HI Does anyone suggest me on how to use MAC address in squidguard. i would like to allow some clients to access net based on their mac address

#

Using squidGuard for content filtering

Posted by: Anonymous [ip: 10.10.53.130] on February 21, 2008 07:52 PM
I've installed squidGuard and BerkeleyDB. When I attempt to start squid, I get a cache.log full of the following errors:
(squidGuard): error while loading shared libraries: libdb-4.6.so: failed to map segment from shared object: Permission denied
This is just before the logs complain about the children dying and squid itself dying.

I've checked the permissions on the BerkeleyDB files and even chown'd them to be owned by squid, but can't seem to fix this.

What am I missing???

THANKS!

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya