Webcrawler Security Analysis

Webcrawler Security Analysis


In September 2013, I developed a Web Crawler for a Security Analysis of Austrian Webservers. This project was done as a "Seminar Project" of 10 ECTS at the Institute for Applied Information Processing and Communications at the University of Technology Graz. I decided to publish the resulting code under the GNU GPL v3 Licence at github.com as many projects at the institute depend on it.

The motivation was to check security related parameters for web servers with domains of Austria (.at). For example we wanted to know which

Webcrawling Basics

If you want to do some tests on your own, to extend the Analysis or the Crawler you should first learn about Webcrawling Basics such as Seeds, Frontier, Crawling Policy (Selection Policy, Re-Visit Policy, Politeness Policy, Parallelization Policy). In my work I ignored the Re-Visit Policy, because I always started crawling from the beginning. (The database structure supports multiple scans).

My Selection-Policy was:

And my Politeness-Policy was:

My Webcrawler in general stores the following information to a database for later analysis:

The Analysis Results

I can present some information to you, but I do not want to publish the details, because it shows potential attack vectors (e.g. old software versions) for specific web sites. For many servers, I found the following information regarding software versions:

XSS-Headers were also analyzed in detail and we came to the conclusion, that they are rarely used (there are many different XSS-Headers).

The analysis of used Scripts showed, that many pages use Google-Plus Buttons, Facebook Buttons, the Google Ajax API, the Google Publisher Tag, ... The problem is, that it would be easy for Google/Facebook/NSA/others to inject javascript code to your website that is executed by the browser and can track all your users. Thank you!

Analyzing about 7400 provided Cookies only 827 used httpOnly. Analyzing 2167 Cookies from HTTPS web sites, it showed that about 1975 used Cookies without the "Secure" flag! Example Attack: A single picture loaded over http could leak the Session-ID.

I analyzed the certificates and grouped them by all of their attributes so you can browse them by property and by value. As an example the "SigAlgName"-property (you see the result of two different crawling sessions separated by a slash):

You can also browse by KeyUsage, PK-Algorithm, Basic Constraints, Extended Key Usage, RSA-Modulus-Bitlength, NonCriticalExtensionOID, ...

I also provide an Inverse Certificate Tree, where you can browse the Certificates by the root certificates and see in each layer which Subject are issued by which Issuer. This is the opposite direction of the Certificate Chain you may know.

Analyzing the validity of Certificates this are the results (Certificates can be invalid because of multiple issues at one time). You see the results of two different crawling sessions:

The SSL-Properties are not very interesting, because the client has to send some algorithms to the Server and the server selects one of them. We do not gain much information about the supported Cipher suites. Afaik another student group is going to do this research currently as an exercise (Try out reduced list of supported Cipher Suites). By the way: Not a single page used SSL Client authentification.

Technical Information

My Crawler is based on Crawler4j, an open source web crawler for Java published under the Apache 2.0 Licence. I had to do some minor changes to it's source code, hence I had to include the sourcecode in my project. My own code is splitted in 3 subprojects: One for the database, one for the Analysis and one for the Crawler. You will need Java, Maven, a local MySQL Database and possibly the Spring Framework. See the README.md-file for more information.

You can find the source code here (published under GPL): https://github.com/IAIK/web-crawler-analysis