Talking about the design and development of vulnerability scanner

I do not want to leave you until the end of the world. </ blockquote>
A few days ago, I changed the “White Hat to WEB Scan” in TSRC. I read it again yesterday. The overall reading is that the knowledge required in the design and development of the WEB vulnerability scanner is more comprehensive, including some pits. However, the description of each aspect is not deep enough and detailed. It is suitable for engineers who are learning to design and develop vulnerability scanners from 0, and provide some design ideas to avoid unnecessary pits. Of course, the design and development of the scanner itself is a very complicated project, the author can not be described in detail in a book, and some pits have to step on their own to know.

This article as a “white hat to talk about WEB scanning” book after reading, or after reading a summary, can be regarded as a summary of the vulnerability scanner design and development. Before the official start summary, thanks in next Author: Liu Xuan, learn from this book were a lot of content, if reprint this article, please be sure to indicate the source.

There are many knowledge points involved in this article. I will first introduce a general introduction to the catalogue. Please allow me to supplement it slowly.....Because of the limited space, some of the content is only briefly introduced (such as crawler development), behind me. Each piece of content will be described separately in detail.

Omit some prefaces

Why do you want to develop a vulnerability scanner, and the difference between different scanners (white, black, gray), its role, what are the advantages and disadvantages, etc., omitting 10,000 words here…

Summary of this article

This article focuses on the design and development of the following two scanners: 1) URL-based WEB vulnerability scanner, 2) fingerprint-based vulnerability scanner. Many commercial scanners on the market currently contain these two scanning functions, but in order to know the principle more clearly, I think it is necessary to introduce them separately. It is necessary to explain that the scanners described in this article are all “active scanners”, which will initiate http requests. As for the passive scanner, it mainly uses the http proxy (burpsuite) or the traffic mirror (the Green League scanner) to scan, that is, the request is not initiated, but the content of the request is obtained for analysis.

How to design a URL-based WEB vulnerability scanner

At least two pieces of information can be read from the title of this section: the input source of the first vulnerability scanner is the URL, and the second vulnerability scan is mainly for the WEB. Developing such a scanner requires at least two issues:

  • How to collect input source (that is, collect website URL)
  • Based on flow cleaning
  • Based on log extraction
  • Based on crawler crawling
  • How to call the scan plugin (that is, scan the URL)

Get URL data from traffic

Generally, developing a scanner in Party A involves this block of content, because the url based on traffic is the one that has the least impact on the business and the most comprehensive coverage. In general, many commercial scanners developed by Party B do not involve traffic cleaning because of difficulties such as deployment.

Traffic Collection Get

You can mirror a traffic from the enterprise portal to a server, and then obtain traffic from the server NIC through some tools. After cleaning, extract the data such as url, post_body, and response. There are many tools for getting traffic, such as [justniffer] (, suricata, etc.

What is the pit of the scan source?

There is no https data in the general traffic, because it cannot be decrypted; the traffic contains user authentication information, and how to handle it gracefully, so that it has no effect on usage.

Get request data from the log

Generally speaking, the development of the scanner in Party A will involve this block of content, because the url based on the log is also a solution that has little impact on the business and has a comprehensive coverage.

Log collection

How to configure nginx to collect logs on the server is not to say, if you are not familiar with nginx, you can learn: [nginx load balancing] (

What are the pits in the log?

Generally, it does not include post_body and response data, because the amount of logs generated every day is very large. If you need to store so much data, the cost is very high, so generally only simple information such as url and timestamp is recorded on the server.

Design and develop a crawler

Different from the general web crawler, the crawler involved in the vulnerability scan is a crawler that crawls all URLs for the same site. Want to develop a good crawler, provided you are familiar with the HTTP protocol. This article does not introduce the http protocol, only summarize some points of attention in the development of crawling. If you don’t know much about reptiles, you can learn to move:
Python crawler basics (sorry, not yet written….)
Python-based vulnerability scanning crawler (sorry, not yet written….)

HEAD replaces GET resources

Note that not all requests use HEAD, but a part of the request that does not require a response body can use HEAD instead of GET. The only difference between the head request and the get request is that it does not return the response body, only the response header.

Some websites have the most basic anti-climbing strategy (detection request header), or some pages require login credentials (cookie authentication), so you also need to add a cookie to the request header in the crawler.

DNS cache to speed up

When we request a domain name each time, we will first obtain the ip address corresponding to the domain name from the dns server, and this parsing record is generally not very changeable, so it can be parsed once at the beginning of crawling, and then cached to the system. Internal and subsequent requests are obtained directly from the system, saving resources.

Page Get New URL

It involves getting urls in different tags, as well as handling dynamic links, static links, homology strategies, duplicate url removal, and more.

Processing page jumps

Page jumps are mainly divided into server-side jumps and client-side jumps. The specific introduction can be moved: [black hat seo series] page jump . The client jump is visible to the user, that is, the response code is 301 or 302 on the first request, the jump address is returned in the Location of the response header, the jump address is requested the second time, and the result is returned. The server request is invisible to the user, that is, only one request, the jump is handled on the server.

Processing Identification 404 page

The general web page does not exist and the response code is 404. However, some websites are user friendly. When accessing a page that does not exist, it will jump to an existing page (response code 302, 301), or the page directly displays the home page content (response code 200), or display a 404 prompt page (response Code 200). For these complicated situations, it is obviously not enough to judge the response code. It is necessary to combine the response code with the page content.
Solution: first access some non-existent pages, get the page content, marked as C1; then visit the target page, if the response code is 404, it means the page does not exist, if not 404, compare the similarity between the page source C2 and C1 If similar, it means that it does not exist.
For details on how to determine a page as a 404 page, please go to:

Handling duplicate URLs

Remove duplicate URLs to avoid duplicates of the same page, of course, for some similar URLs. The solution can be to store the url hash in memory, such as python’s list object, and then determine that the new url hash is not in the list. If it does not exist, the url is queued for crawling.

Calculating page similarity

The Hamming distance calculation can be used specifically, which is very helpful for identifying 404 pages.

Request disconnection retry

Sometimes, due to network delay, the request will be disconnected. At this time, you need to try again until the number reaches the retry threshold.

Parsing page form
Parsing events and ajax requests

Many web pages currently send requests via ajax, so we also require our crawlers to be able to resolve ajax requests. Including some events on the page, you also need a crawler to trigger.

Web2.0 crawler

The biggest difference between web2.0 and web1.0 is that it adds a lot of dynamic content. Many of the page content is dynamically generated by js. So this requires our crawlers to have the ability to parse js. Several modules are recommended here, phantomjs, [chromeheadless] (, and the like.

Maintenance Vulnerability Library

To put it simply, the vulnerability scanner is mainly divided into two functions: input and scan. It is not possible to have an input source. It must have scanning capability, and the scanning capability mainly depends on the accumulation of scanning plug-ins.

How to gracefully replay a request for a url

Some urls can be obtained from the traffic, as well as post data information (including authentication). Since we are designing a proactive scanner and need to initiate requests, it is a problem to replay these requests gracefully. . Due to the long expiration of cookies on some websites, replay requests are bound to have an impact on the business, and without the use of cookies, many pages are inaccessible.
One solution could be to replace the cookie with a cookie that tests the account, so that it has no effect on the user, but there are many pits.

How to design a fingerprint-based vulnerability scanner

At least two pieces of information can be seen from the title of this section: the input source of the first vulnerability scanner is the service fingerprint, and the second vulnerability scan is for the WEB+ service.
Developing such a scanner requires at least two steps:

  • Collect input source (ie capture system fingerprint)
  • Port scanning
  • Fingerprint scanning
  • Fingerprint matching
  • Call the scan plugin (that is, match the fingerprint for vulnerability scanning)

Development Port Scanner

You can use the python socket module to develop tcp scans, or use [nmap] (, masscan, zmap and other open source tools to scan.

Developing a fingerprint scanner

You can use nmap to scan, because nmap contains a lot of fingerprint probes, which can identify most of the service fingerprint information. For web fingerprints, you need to initiate an http request, get the response content, and then use the web fingerprint library to identify, or use an open source fingerprint scanner, such as [Whatweb] ( /)Wait.

Maintaining the fingerprint library

Only fingerprints without a fingerprint library are not acceptable. Fingerprints are like some identity information, and we end up targeting a certain person. Therefore, we need to have a fingerprint database to associate fingerprint information with people.

Input Source -> Queue -> Task Distribution -> Scan Node -> Storage How to design

The above briefly introduces the problems that need to be solved in the design and development of two kinds of scanners, and from the overall point of view, the problems that need to be solved are far from enough. For example, when the system to be scanned is very large, how to deploy distributedly requires our scanning framework to meet the needs of distributed deployment.

Recommended technology stack: python+rabbitmq+celery+ Mysql

Do you think that this is the only way to end? No, no, today is a little tired, and continue to add in a few days, such as adding some basic code and adding some more detailed content. Also, there is a lot of information on the Internet, and I have to sort out a wave of learning.

本文标题:Talking about the design and development of vulnerability scanner


发布时间:2018年03月16日 - 13:03

最后更新:2019年08月16日 - 15:08


许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

nmask wechat