TL;DR
I created a Twitter-bot which monitors multiple paste sites for different types of content (account/database dumps, network device configuration files, etc.). You can find it on Twitter and on Github.
Introduction
Paste-sites such as Pastebin, Pastie, Slexy, and many others offer users (often anonymously) the ability to upload raw text of their choice. This is helpful in many scenarios, such as sending a crash report to someone or pasting temporary code. However, in addition to some people not being careful with what they upload (leaving passwords and other sensitive data in the text), attackers have been starting to use these sites to share post-compromise data, including user account data, database dumps, URLs of compromised sites, and more.
Since there are so many users uploading text to these sites, it's often difficult to find these interesting files manually. While techniques such as Google Alerts can be applied, the results are often a day or two old and are sometimes deleted. This prompted me to create a tool which monitors these sites in "real-time" (less than a minute of delay for the slowest sites) for specific expressions, and then automatically rank, aggregate, and post these results to Twitter for further analysis. I call this tool DumpMon.
Similar Tools
There are a couple of similar tools available which do essentially the same thing as dumpmon - with just a few key differences:
- @PastebinLeaks - with its last tweet on December 16, 2011, PastebinLeaks no longer appears to provide pastebin monitoring. However, I really like how it integrated quite a few different expressions, such as one for HTTP passwords, Cisco and Juniper configuration files, etc. Unfortunately, as far as I can tell PastebinLeaks is closed-source.
- @PastebinDorks - This bot (intentionally closed-source, still in "alpha") is still active and posts a few tweets per day. This bot appears to be primarily concerned with account credential dumps. I think the idea of assigning a numerical rank to a tweet could help determine the usefulness of a paste, but it makes the actual data found unclear.
My goal with dumpmon is to create the "next step" of paste site monitoring with the following key features:
- Open-Source. I'm always open to contributions via Github. I'm working on creating all the documentation - should be up soon.
- Monitors more than just Pastebin (full site listing in Appendix)
- Supports multiple file types (ie the Cisco configuration files and honeypot logs)
- For large account dumps, simply gives you the raw information (Emails: x, Hashes: y) directly in tweet
In the future, I would like to look into implementing the following features:
- Automatically run found hashes through large wordlists and posting results
- Allow users to tweet a regular expression they want monitored to the bot. The bot will then tweet them the paste once it finds a match
- Search for interesting details from other sources of information (such as popular forums, etc.) instead of just paste sites
- Allow caching of "most interesting" results to prevent deletion
- Create daily/monthly reports that show the amount of detected data for aiding in password research
With those features outlined - let me quickly show you how I built the bot. Don't care? Just go straight to the bot here.
Bot Architecture
Here is the general architecture of the bot that's currently running:
As you can see, each site runs from its own separate thread which monitors for new pastes, downloads each one and matches it against a series of regular expressions. Then, if it finds a match, it will build and post a tweet that looks like the following:
If hashes are found, it will also include the number of hashes as well as the ratio of emails to hashes. The "Keywords" attribute seen gives an approximate ratio of "positive keywords" found out of a given list, such as "Target: ", "available dbs", "member_id","hacked by", "database: ", etc.), subtracting value for each regex matched from the blacklist. Just another metric to help determine if a paste is "interesting." It should also be noted that the emails are found are unique.
Don't Bite the Hand that Feeds
It's commonly that the most time-expensive part of web scraping is actually fetching the content. While I could go about speeding up this process by completely using an event-driven framework such as Gevent, Twisted, or others, I wanted to do my best to my best to respect the sites hosting the content. Also, I didn't want the tool to get temporarily blocked... For a third time (my bad, Pastebin). With this being the case, my bot uses the following algorithm to only get new pastes using polite time constraints.
Appendix
Currently, dumpmon supports the following paste-types:
- Account/Database dumps
- Google API Keys
- Cisco Configuration Files (Juniper to be added soon)
- Honeypot Log Dumps
If you can think of any other paste sites you want added, let me know!
Follow @dumpmon
- Jordan