Catching Bad Guys using Web Server Logs & HTTP 404 Errors!

What is a HTTP 404 Error?

Hypertext Transport Protocol (HTTP) is the layer that webpage data is transmitted to your web browser that renders it on your computer screen.

OSI Presentation Layer
HTTP is on the OSI Presentation Layer

On the OSI model, HTTP is on the Presentation Layer.

HTTP also works on the client server model. The client, your web browser, requests a file from a web server via a URL. This page has a URL that brought you here.

If logging is enabled on your web server you will have a record of all HTTP requests to review. Try here is you need help configuring logging on IIS Server.

If a clients HTTP request was successful, the server returns the data and a “200” code that means OK.

If a clients HTTP request fails because the file you requested is missing, the server sends back a HTTP 404 error meaning “Not Found”.

Hackers, Recon and the HTTP 404 Error!

404 NOT FOUND pages in your web server logs are often the earliest signs of surveillance, foot printing or reconnaissance.

I guarantee you that unless your attacker has inside knowledge of the target, any recon attempts using HTTP will most likely be generating some 404 errors. You can’t avoid it as it is a byproduct of the enumeration process.

Video: Enumeration by HackerSploit

Early Recon Detection

Early recon detection along with early blocking actions can be a game changer in the never ending game of digital tag where you really don’t want to be it.

If you you’re doing this for a living and take the topic more seriously, I would use the word, countersurveillance, to describe what we need. Who is coming at us, where are they coming from and how can we mitigate risk?

Log Files & Free Analysis Tools

For this example I’m going to focus on Microsoft IIS web server logs as that is what I have handy.

Notice the naming of the log files in the screenshot below. They contain the start date of each log. Example: u_ex220916.log.

In this example, the logs are created, one per day. The server can be configured to make the logs run for a week or even for a month. I prefer smaller files.

NotePad++ for Reviewing Log Files

Most of the time I just use Notepad++ to review logs file on a daily or weekly basis. Notepad++ does a great job of searching all open files.

All I do is run a search across all open files for ” 404 “. Make sure to leave a space character on each side for this to work correctly.

Open multiple files at one time using Notepad++

CTRL+F to Find our 404’s

With Notepad++ opening all of our files, we’ll user CTRL+F to open the Find dialog window. Type in ” 404 ”

Actual Recon 404 Errors from IIS Log

Actual IIS Log with HTTP 404 errors showing recon event identified

Where are the 404s?

Most of the log entries are very long and difficult to display online.

This 2nd screenshot shows where the actual 404 error is on each line in the server log. It is towards the end of each line.

Log Parser 2.2

Log Parser 2.2 is what I use for parsing larger amounts of log IIS log files. Log Parser lets you use SQL like commands to query the data which can be output to CSV files.

I’ll be coming back to add some log parser query examples as soon as I can get them from my work notes.

C:\temp\logs\logparser "select * from u_ex180131.log" -o:datagrid

After running this command log parser will open with your log data. You can copy it out to excel where you can do your analysis.

Also, I came across this page for a freeware OLEDB extension that says you can use it to query any OLEDB datasource which log parser doesn’t support natively.

Video: How to Use Log Parser 2.2

Countersurveillance and SiteSpy

Now that I’ve covered how to find recon attempts in a log file using Notepad++ and Log Parser, I’ll share my personal “Ace in the hole”, SiteSpy.

SiteSpy is an application monitor I originally developed back in 2002 by accident when I was teaching web programming at Modesto Institute of Technology (MIT).

SiteSpy takes advantage of some existing Microsoft technologies, by running a monitor in the same memory space as the web application.

SiteSpy sniffs out the session connections in real-time and displays them in a webpage that is refreshed frequently. Items of interest bubble up to the top for review. Bot traffic can be filtered to increase recon sensitivity. It works well but is not 100% effective.

SiteSpy Recon Detection Example

This is a probing event I caught was using the IP, bypassing DNS while probing for non-existent file called “/admin/config.php” all the way from Ramallah Palestine. 

SiteSpy showed me a hit on the 404 page after they got the initial 404 error. Otherwise, I would not had seen it until much later. I was able to update the firewall within minutes, denying them time and space to do more recon or an actual attack from that IP range for now.

In Conclusion

You don’t need a lot of fancy Cybersecurity tools to do a little blue teaming. Just Notepad++ and the desire to learn.

The most important thing when reviewing a web application’s log files is to first know the application and all possible URL patterns.

Once you have established normal patterns, you can more easily find things that seem out of place.

Hope this helps someone!

Cyber Abyss

Create a Robots.txt Honeypot Cyber Counterintelligence Tool

Video Overview of this Blog Post

What is the Robots.txt File?

The Robots.txt is just a simple text file meant to be consumed by search engines and web crawlers containing structured text that explains rules for crawling your website.

In theory, the Search Engines are supposed to honor the Robot.txt rules and not scan any URLs in the Robots.txt file if told not to.

Robots.txt was supposed to help avoid overloading websites with requests. According to Google, it is not a mechanism for keeping a webpage out of Google Search results.

If you really want to keep a web page out of Google you should try adding a noindex tag reference or password-protect the page.

With a Robots.txt file, you can create rules for user agents specifying what directories they can access or disallow them all. It all sounds OK in principal but on the internet, nobody really plays by the rules.

In fact, the Robots.txt file is one of the first places a bad guy might look for information on how your website is structured.

Too many websites make the mistake of using the Robot.txt file without giving thought to the fact they might be rewarding possible OSINT or hacking reconnaissance efforts at the same time.

A Look at’s Robots.txt File

If we take a quick look a big website like to see what their Robots.txt file looks like all we have to do is load up this URL.

What files or directories does Amazon tell the Google search engine not to crawl or index?

It looks like account access login and email a friend features are off limits so these are the first places a hacker will be looking.

More Sample Robots.txt files from Google

# Example 1: Block only Googlebot
User-agent: Googlebot
Disallow: /

# Example 2: Block Googlebot and Adsbot
User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

# Example 3: Block all but AdsBot crawlers
User-agent: *,
Disallow: /

You can find more detailed information on how to make more complex robots.txt files over on the Google Search Central area for developers.

What is Counterintelligence?

Counterintelligence typically describes an activity aimed at protecting an agency’s intelligence program from an opposition’s intelligence service.

More specifically Information collection activities related to preventing espionagesabotageassassinations or other intelligence activities conducted by, for, or on behalf of foreign powers, organizations or persons.

In terms of this blog post, we’ll use a Honeypot based approach to see who is using looking at the Robots.txt file and scanning folders we’ve asked them not to and record information about the HTTP call for later review and analysis.

A Real World Robots.txt Based HoneyPot Example

Using the Robots.txt file as part of a honeypot system, we will broadcast a list of honeypot folders we don’t want search engines to index, but in this case, it will be a list of folders pointing to honeypot pages.

Having a honeypot / data collection service running in these folders allows you to see who is using the Robots.txt file to scan your web server thus tipping you off that OSINT footprinting activity on your webserver or domain names may be taking place.

These folders have a Disallow rule but contain honeypot code to collect information about the HTTP calls made against them and in some cases to redirect the user-agent somewhere else.

A Sample Honeypot Robots.txt File

A Sample Robots.txt file where we are telling all user-agents to stay away from our admin, wordpress and api folders.

# All other directories on the site are allowed by default
User-agent: *
Disallow: /admin/
Disallow: /wordpress/
Disallow: /api/

The Robots.txt file I’m discussing in this post was collected from the online classifieds website,

If you try to hit any of the URLs found in the Robots.txt file, you’ll be redirected to a Rick Roll video on IP address data is collected in a log for a more detailed review.

Each one of these honeypot URLs do a little something different.

The first honeypot URL replies with something naughty, the second logs it as a scan and the third URL is the Rick Roll redirect.

Depending on the honeypot page, you can collect data from the user-agent and log it before you redirect them off to the Land of Oz.

Using a tool that was originally created to be helpful that ended up becoming dangerous can now be a double agent if you set it up correctly.

Hoping this helps someone on their InfoSec journey.

~Cyber Abyss