Below is an article I wrote for IndexuSupport.com, so far it’s article 4 in the series. I posted all 4 tonight and you can read them on my site for free.
Many users often ask me how they can get a database for their website. There are numerous answers to this question.
1) You can buy databases from multiple sources. Some of them good, a lot of them bad. Remember to ask questions and get a sample before you buy.
2) You can use a DMOZ slice. This is a really good way to go, in fact it’s how a lot of my larger sites started. If you don’t want to buy the DMOZ Extractor you can buy slices for as low as $5.00 from me.
3) You can scrape it!
First lets get this out of the way. Scraping a database off the net is probably as close as you can come to stealing. Scraping involves processing one or more sites through a few programs that will save and extract the data that you want.
There are millions of web sites on the internet, all with a lot of links, content and data. What kind of database you want will depend on where you go look for your data.
For this example lets just assume you have a CD-ROM with the Yellow Pages on it in HTML format. And for simplicity lets assume that there are 27 pages (a-z plus numbered companies). Yes they would be large pages, but that’s not the point.
The first thing you would do is copy those pages to your hard drive so they can be processed faster. The read time off a CD-Rom is not as fast as your hard drive. Similarly if you were trying to scrape webpages you would want to copy them to your hard drive as well.
For copying a website to your hard drive you would probably use a program like Offline Explorer. There are three versions, Offline Explorer, Offline Explorer Pro and Offline Explorer Enterprise. At the minimum you want the Pro version, it’s costs around $90 and is essential in your data mining/scraping endeavour.
So you have the web pages on your hard drive in a specific directory. Now you want to suck the data out of it. For our purpose we only want the URL’s but you can mine any data you want.
For the actual data mining you need a program called TextPipe. TextPipe comes in three versions, TextPipe Lite, TextPipe Standard and TextPipe Pro. At a minimum you need TextPipe Standard and it costs $199.
I won’t get into how to work TextPipe here, but will say that once you start up TextPipe you simply add some filters, point it to a directory, file or set of files and click go. From there TextPipe does all the work and mines the data from the pages. It will output the data where you tell it to.
TextPipe Standard can strip HTML tags for you, which saves you a TON of time messing around with the saved data. What you are left with is simply a list of URL’s.
With a list of URL’s you can then import them into IndexU, and use the Fetch Meta function of IndexU to download the title, description and keywords of all the URL’s in the database.
But don’t get me wrong, TextPipe can do a lot more than just URL’s. It can save almost any formatted data off of a webpage including addresses, phone numbers, descriptions and much much more. I just used URL’s as an example.
Sure I made it all sound simple, and once you’re familiar with how these programs work and how the filters work with TextPipe there is really no end to the kinds of databases that you can create.
In all seriousness, if you have a single website running IndexU then this method is not for you. You’re looking at a cost of around $300 US just for the software, and then add your time and the learning curve required and it’s not worth the effort. However you could pay someone to do the scraping/data mining for you. It may sound expensive, but it sure saves learning how to write filters!
——————————————————————–
If you are interested in any data mining please let me know and I can try and get you a quote. I would need the following information
1) The approximate number of pages to mine and URL’s
2) The topic you are after for your database (I may have URL suggestions)
3) The exact data you want from each page
4) What format you want the data in (completed database, raw)
5) Your expected time frame
Please understand that data mining is a service that I am offering personally, not Nicecoder.









