Comparing Pages – Word Length Signatures

During a recent project involving the scanning of web pages for vulnerabilities, I came across a problem – many pages change in insignificant ways during consecutive visits, and it is meaningless to compare them directly.

For example, if I wanted to automate the injection of SQL vectors into the GET parameters of a web page, I would need to compare my result to the original page in order to reach a conclusion. However, if there had been a randomly changing element on the page, I might arrive at a false positive.

Some scanners deal with this by making a few preliminary visits to a page in order to ensure that it doesn’t randomly change. However, if the criteria for “change” is as strict as an absolute match, then the scanner might often miss vulnerable pages – or worse – scan nothing at all.

In order to tackle this problem, I looked into the source code of Skipfish, which is an extremely fast, open source vulnerability scanner written in C++. Their method of comparison is to build a signature out of word length and location on a page, and then compare these signatures with a specified tolerance.

I ported the C++ code to Python, and it worked beautifully. Here is my Python code (sans inline comments) to generate the signature itself:

	"""
	generateSig
		pageContent : The content that was in the page

	Generates and returns the signature for this page, based on the
	response body.
	"""
	def generateSig(self, pageContent):
		# Generate the signature
		code = [0 for x in range(config.fp_size)]
		c_len = 0
		in_space = False

		for i in range(len(payload)):
			if ord(payload[i]) <= 32 or payload[i] == '<' \
			   or payload[i] == '>' or payload[i] == '\'' \
			   or payload[i] == '"':
				if not in_space:
					in_space = True
					if c_len <= config.fp_maxlen:
						code[c_len % config.fp_size] += 1
					c_len = 0
				else:
					c_len += 1
			else:
				if in_space:
					in_space = False
					if c_len <= config.fp_maxlen:
						code[c_len % config.fp_size] += 1
					c_len = 0
				else:
					c_len += 1
					code[c_len % config.fp_size] += 1

		# Return the signature
		return code

Note that there are variables prefixed with config, these are configurable and define parameters relating to the signature. Here fp_size refers to the length of the fingerprint array and fp_maxlen specifies the largest value we can place in any one cell of the array. The variable c_len stores the length of the word we are currently reading.

In order to differentiate between words and set the fingerprint accordingly, we use in_space as a means of switching states. In either state, we can read either a delimiter or a non-delimiting character. Here is how we respond to each of the possible cases:

Read Delimiting Read Non-Delimiting
In Space
  • Increment current word length c_len.
  • Set in_space to false.
  • Increment fingerprint cell at position c_len mod fp_size (if fp_maxlen allows).
  • Set current word length c_len to zero.
Not In Space
  • Set in_space to true.
  • Increment fingerprint cell at position c_len mod fp_size (if fp_maxlen allows).
  • Set current word length c_len to zero.
  • Increment current word length c_len.
  • Increment fingerprint cell at position c_len mod fp_size.

This will give us a more or less unique signature - no two signatures will be the same unless the word lengths and locations, including the delimiters, match exactly. What the words actually are is irrelevant to the signature.

Now we come to the comparison step. We cannot simply compare the arrays using an equality operator, because that gives us zero tolerance and we can only identify pages that match exactly. Instead, we will compare the signature arrays index-by-index, using a relative tolerance value and an absolute tolerance value.

To save time and space, we also have a limit on the amount of dissimilarity we will tolerate during the comparison.

	"""
	compareSigs
		signatureOne: The first page's signature
		signatureTwo: The second page's signature

	Returns a boolean value based on whether the signatures match
	within tolerance (specified in the config). True means that
	page content matches within tolerance, False means it doesn't.
	"""
	def compareSigs(self, signatureOne, signatureTwo):
		# Keeps count of total failure
		bucket_fail = 0
		# Keeps count of total difference
		total_diff = 0
		# Keeps count of total scale
		total_scale = 0

		# Iterate over each value in the signature
		for i in range(config.fp_size):
			# Calculate difference and scale
			diff = signatureOne[i] - signatureTwo[i]
			scale = signatureOne[i] + signatureTwo[i]
			# Check difference and scale against allowed tolerance
			if abs(diff) > \
			      (1 + (scale * config.fp_reltolerance / 100)) \
			      or (abs(diff) > config.fp_abstolerance):
				bucket_fail += 1
				if bucket_fail > config.fp_maxfailure:
					return False
			# Update total count
			total_diff += diff
			total_scale += scale
		# Check against allowed tolerance again
		if abs(total_diff) > \
		      (1 + (total_scale * config.fp_reltolerance / 100)):
			return False
		return True

Here the configurable values are: fp_reltolerance, the relative tolerance; fp_abstolerance, the absolute tolerance; and fp_maxfailure, the amount of failure tolerated before the comparison is halted and False is returned.

I feel that this is a decent way to compare pages to check for changes that are significant, while ignoring minor differences that would otherwise lead to false positives (or negatives).

Keep in mind that since you are comparing webpages, you might also want to compare the headers, which is something I have omitted here.

30th July, 2011

6 Responses to “Comparing Pages – Word Length Signatures”

  1. Romeo Montague
    8:37 pm @ 17th May 2012

    Hi, I just wanted to say 1000 thanks for making that neat addon groove shredder :) I enjoy it a lot, and send you many greetings from sunny Makedonia!

    Peace
    RM

Leave a Reply