During a recent project involving the scanning of web pages for vulnerabilities, I came across a problem – many pages change in insignificant ways during consecutive visits, and it is meaningless to compare them directly.
For example, if I wanted to automate the injection of SQL vectors into the GET parameters of a web page, I would need to compare my result to the original page in order to reach a conclusion. However, if there had been a randomly changing element on the page, I might arrive at a false positive.
Some scanners deal with this by making a few preliminary visits to a page in order to ensure that it doesn’t randomly change. However, if the criteria for “change” is as strict as an absolute match, then the scanner might often miss vulnerable pages – or worse – scan nothing at all.
In order to tackle this problem, I looked into the source code of Skipfish, which is an extremely fast, open source vulnerability scanner written in C++. Their method of comparison is to build a signature out of word length and location on a page, and then compare these signatures with a specified tolerance.
I ported the C++ code to Python, and it worked beautifully. Here is my Python code (sans inline comments) to generate the signature itself:
""" generateSig pageContent : The content that was in the page Generates and returns the signature for this page, based on the response body. """ def generateSig(self, pageContent): # Generate the signature code = [0 for x in range(config.fp_size)] c_len = 0 in_space = False for i in range(len(payload)): if ord(payload[i]) <= 32 or payload[i] == '<' \ or payload[i] == '>' or payload[i] == '\'' \ or payload[i] == '"': if not in_space: in_space = True if c_len <= config.fp_maxlen: code[c_len % config.fp_size] += 1 c_len = 0 else: c_len += 1 else: if in_space: in_space = False if c_len <= config.fp_maxlen: code[c_len % config.fp_size] += 1 c_len = 0 else: c_len += 1 code[c_len % config.fp_size] += 1 # Return the signature return code
Note that there are variables prefixed with config, these are configurable and define parameters relating to the signature. Here fp_size refers to the length of the fingerprint array and fp_maxlen specifies the largest value we can place in any one cell of the array. The variable c_len stores the length of the word we are currently reading.
In order to differentiate between words and set the fingerprint accordingly, we use in_space as a means of switching states. In either state, we can read either a delimiter or a non-delimiting character. Here is how we respond to each of the possible cases:
| Read Delimiting | Read Non-Delimiting | |
|---|---|---|
| In Space |
|
|
| Not In Space |
|
|
This will give us a more or less unique signature - no two signatures will be the same unless the word lengths and locations, including the delimiters, match exactly. What the words actually are is irrelevant to the signature.
Now we come to the comparison step. We cannot simply compare the arrays using an equality operator, because that gives us zero tolerance and we can only identify pages that match exactly. Instead, we will compare the signature arrays index-by-index, using a relative tolerance value and an absolute tolerance value.
To save time and space, we also have a limit on the amount of dissimilarity we will tolerate during the comparison.
""" compareSigs signatureOne: The first page's signature signatureTwo: The second page's signature Returns a boolean value based on whether the signatures match within tolerance (specified in the config). True means that page content matches within tolerance, False means it doesn't. """ def compareSigs(self, signatureOne, signatureTwo): # Keeps count of total failure bucket_fail = 0 # Keeps count of total difference total_diff = 0 # Keeps count of total scale total_scale = 0 # Iterate over each value in the signature for i in range(config.fp_size): # Calculate difference and scale diff = signatureOne[i] - signatureTwo[i] scale = signatureOne[i] + signatureTwo[i] # Check difference and scale against allowed tolerance if abs(diff) > \ (1 + (scale * config.fp_reltolerance / 100)) \ or (abs(diff) > config.fp_abstolerance): bucket_fail += 1 if bucket_fail > config.fp_maxfailure: return False # Update total count total_diff += diff total_scale += scale # Check against allowed tolerance again if abs(total_diff) > \ (1 + (total_scale * config.fp_reltolerance / 100)): return False return True
Here the configurable values are: fp_reltolerance, the relative tolerance; fp_abstolerance, the absolute tolerance; and fp_maxfailure, the amount of failure tolerated before the comparison is halted and False is returned.
I feel that this is a decent way to compare pages to check for changes that are significant, while ignoring minor differences that would otherwise lead to false positives (or negatives).
Keep in mind that since you are comparing webpages, you might also want to compare the headers, which is something I have omitted here.
Hi, I just wanted to say 1000 thanks for making that neat addon groove shredder
I enjoy it a lot, and send you many greetings from sunny Makedonia!
Peace
RM