Webcrawler Security Analysis

2013-11-11

Tags:

In September 2013, I developed a Web Crawler for a Security Analysis of Austrian Webservers. This project was done as a "Seminar Project" of 10 ECTS at the Institute for Applied Information Processing and Communications at the University of Technology Graz. I decided to publish the resulting code under the GNU GPL v3 Licence at github.com as many projects at the institute depend on it.

The motivation was to check security related parameters for web servers with domains of Austria (.at). For example we wanted to know which

XSS Headers are used, if the
Cookie usage is correct (e.g. secure, httpOnly),
What software versions are used,
Which external scripts are loaded (Facebook, g+,...)
Which SSL-Properties are used and we wanted to
Analyze Certificates

Which are Valid?
Why are they invalid?
What hash functions are used?
What signature methods?
What public keys?
What aboud the keyusage?

Webcrawling Basics

If you want to do some tests on your own, to extend the Analysis or the Crawler you should first learn about Webcrawling Basics such as Seeds, Frontier, Crawling Policy (Selection Policy, Re-Visit Policy, Politeness Policy, Parallelization Policy). In my work I ignored the Re-Visit Policy, because I always started crawling from the beginning. (The database structure supports multiple scans).

My Selection-Policy was:

Start with pages given as seeds
If page is on a whitelist: crawl always
Only visit domains that end with ".at"
Visit each (full) domain only once
If the website uses http, try if https is available

And my Politeness-Policy was:

Do not visit the same domain more often than each ... seconds
Respect the robots.txt and ignore the URL if access for crawlers is forbidden

My Webcrawler in general stores the following information to a database for later analysis:

All Headers except those on a Blacklist
Detected Scripts (Facebook Like Buttons, G+, External Java Script Libraries,...)
SSL Parameters (used Methods)
Certificate (encoded and additionally extracted some information to DB-fields for easy lookups)
The Html-Meta-Tag "Generator" to find out about the running software

The Analysis Results

I can present some information to you, but I do not want to publish the details, because it shows potential attack vectors (e.g. old software versions) for specific web sites. For many servers, I found the following information regarding software versions:

Operating System (e.g. "Ubuntu 4.14") in the "Powered-By"-Header and in the "Server"-Header
PHP Version (e.g. 5.2.10) in the "Powered-By"-Header
Asp.Net-Version (e.g. 4.0.30319) in the "X-AspNet-Version"-Header
Server Software and Version (e.g. "Apache-Coyote/1.1") in the "Server"-Header
Joomla/Wordpress/Type3-Versions in the Html-Meta-Tag "Generator"

XSS-Headers were also analyzed in detail and we came to the conclusion, that they are rarely used (there are many different XSS-Headers).

The analysis of used Scripts showed, that many pages use Google-Plus Buttons, Facebook Buttons, the Google Ajax API, the Google Publisher Tag, ... The problem is, that it would be easy for Google/Facebook/NSA/others to inject javascript code to your website that is executed by the browser and can track all your users. Thank you!

Analyzing about 7400 provided Cookies only 827 used httpOnly. Analyzing 2167 Cookies from HTTPS web sites, it showed that about 1975 used Cookies without the "Secure" flag! Example Attack: A single picture loaded over http could leak the Session-ID.

I analyzed the certificates and grouped them by all of their attributes so you can browse them by property and by value. As an example the "SigAlgName"-property (you see the result of two different crawling sessions separated by a slash):

SHA256withRSA (22 / 36)
MD5withRSA (79 / 232)
SHA1withRSA (2366 / 4895)
SHA512withRSA (3 / 4)

You can also browse by KeyUsage, PK-Algorithm, Basic Constraints, Extended Key Usage, RSA-Modulus-Bitlength, NonCriticalExtensionOID, ...

I also provide an Inverse Certificate Tree, where you can browse the Certificates by the root certificates and see in each layer which Subject are issued by which Issuer. This is the opposite direction of the Certificate Chain you may know.

Analyzing the validity of Certificates this are the results (Certificates can be invalid because of multiple issues at one time). You see the results of two different crawling sessions:

Valid Certificates: 1269 / 2231
Invalid Hostnames: 6496 / 18720
Expired Certificates: 1497 / 4276
Not yet valid Certificates: 0 / 0
No Trust Anchor: 2675 / 7787
Additional Exceptions: 341 / 1347

The SSL-Properties are not very interesting, because the client has to send some algorithms to the Server and the server selects one of them. We do not gain much information about the supported Cipher suites. Afaik another student group is going to do this research currently as an exercise (Try out reduced list of supported Cipher Suites). By the way: Not a single page used SSL Client authentification.

Technical Information

My Crawler is based on Crawler4j, an open source web crawler for Java published under the Apache 2.0 Licence. I had to do some minor changes to it's source code, hence I had to include the sourcecode in my project. My own code is splitted in 3 subprojects: One for the database, one for the Analysis and one for the Crawler. You will need Java, Maven, a local MySQL Database and possibly the Spring Framework. See the README.md-file for more information.

You can find the source code here (published under GPL): https://github.com/IAIK/web-crawler-analysis