This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

Extract data from the Internet with Web scraping

By Rob Reilly on March 29, 2006 (9:00:00 AM)

Share    Print    Comments   

Even if you don't know how to access databases using a Web browser or use an RSS reader, you can extract information from the Internet through Web page scraping. Here's how you can use some Linux-based tools to get data.

First, you need to decide what data you want and what search strings you'll use to get it. Obviously, if you need only three lines of information, cutting and pasting is the best way to go. However, if a Web page has 640 lines of useful data that you need to download once a day for the next week, automating the process makes a lot of sense.

Start small. Live Web pages can be complicated to scrape, and designing search strings can be challenging. As your experience grows, you'll see faster and more efficient ways to use the techniques.

For this example, I wanted see how my articles were doing with page views. I created a demo Web page using a subset of my actual home page. I added fictitious page-view numbers to the original data to give myself some numerical information to work with.

Take a look at this portion of the HTML code for the example page, named index-demo.htm. I've kept the example HTML code simple, so you can see how the commands and techniques work. Assume that the page resides on a Web server. Also, note that I haven't coded any wrapping into the title and page-view line.

<html>
<body>

<b><font size="+1">Rob's Demo Web Page</font></b>
<br>
<br>

<!----------- Articles ------------------>
&nbsp;<a href="http://software.newsforge.com/article.pl?sid=05/08/25/167213"> Using WireFusion for 3D animation</a>&nbsp;&nbsp;&nbsp;(09/08/2005 on NewsForge.com)&nbsp;&nbsp;&nbsp;(103,000 page views)<br>
&nbsp;&nbsp;&nbsp;<em>An interesting GUI program for 3D animation</em>
   <br>
   <br>

&nbsp;<a href="http://software.newsforge.com/software/05/07/08/1913220.shtml"> Spice up your presentations with OpenOffice.org Impress</a>&nbsp;&nbsp;&nbsp;(07/13/2005 on NewsForge.com)&nbsp;&nbsp;&nbsp;(652,000 page views)<br>
&nbsp;&nbsp;&nbsp;<em>Advanced OOo Impress techniques</em>
   <br>
   <br>
.
.
.

&nbsp;<a href="http://management.itmanagersjournal.com/article.pl?sid=04/12/28/2159255"> Personal branding for IT professionals - Part 2</a>&nbsp;&nbsp;&nbsp;(12/29/2004 on IT Managers Journal.com)&nbsp;&nbsp;&nbsp;(489,000 page views)<br>
&nbsp;&nbsp;&nbsp;<em>Part 2 - How to use Open Source tools to help market yourself</em>
   <br>
   <br>

</body>
</html>

Prepare the data for import

You'll use several command-line programs to manipulate the data into a form that you can import right into an application. In this case, I'm using OpenOffice.org Calc, but you can use others.

First, download the page using GNU Wget, a command-line Web-page grabber that features an impressive selection of options. It can download a single page or a whole site recursively.

For this example, I want to download the index-demo.htm page and save it in the local directory. Here, the -q option suppresses unnecessary command commentary:

> wget -q http://home.earthlink.net/~robreilly/index-demo.htm

Next, pull out the desired lines using grep, which is a text-search tool that finds lines matched by a string. Looking at the index-demo.htm file, you can see that you can use <a> tag and three non-breaking space string (&nbsp;) to pull the title and page-view numbers out of the file. Redirect the output to tmp1.txt for use in the next step:

> grep '</a>&nbsp;&nbsp;&nbsp' index-demo.htm > tmp1.txt

Next, grab the required fields using gawk, a pattern-matching tool that can also format the text output. Looking at tmp1.txt, you can see that a line is separated into parts using the "(" character. You want to extract the article name (field 1) and page-view number (field 3) and direct the output to tmp2.txt:

> gawk -F'(' '{print $1 " " $3}' tmp1.txt > tmp2.txt

Next, clean up the output so you can import it into OpenOffice.org Calc smoothly. Use sed, which is a text-stream editor that can read a file and make string substitutions on the fly. Use the two instances of "*" in a row to substitute in nothing, thus cleaning up the text:

> sed 's*</a>&nbsp;&nbsp;&nbsp**; s*page views)<br>**' tmp2.txt > output.csv

The "*" character marks the strings. You'll quickly find real Web pages where using the standard sed "/" string marker makes the substitution confusing. It's OK to use another character, as long as it isn't a special character (check the sed manual) and doesn't appear in your desired text string.

I also put two substitutions together in one sed command using the ";" character. The first one gets rid of the &nbsp; string, and the second gets rid of the "page views)<br>" string. Notice that I left one ";" character after the third "&nbsp;" string. This conveniently delimits the two fields when importing into Calc.

Import the list into OOo Calc

Calcchart
Click to enlarge
Finally, we can import the output.csv file into the OOo Calc application. OpenOffice.org versions 2.0 and above use separate commands for the various programs. Use oocalc to start Calc from the command line:

> oocalc output.csv

Calc knows it's a comma-delimited file and starts the import dialog box automatically. Click the semicolon checkbox and press OK to bring the data into the blank spreadsheet.

Create a chart by highlighting the text and clicking the Chart button. Choose the bar chart using the first column as the label. Click the Create button, and the chart with the page-view data appears (see figure).

Once you get the hang of using these tools, you can streamline the process by putting the steps into a script:

wget -q http://home.earthlink.net/~robreilly/index-demo.htm
grep '</a>&nbsp;&nbsp;&nbsp' index-demo.htm > tmp1.txt
gawk -F'(' '{print $1 " " $3}' tmp1.txt > tmp2.txt
sed 's*</a>&nbsp;&nbsp;&nbsp**; s*page views)<br>**' tmp2.txt > output.csv
rm tmp1.txt tmp2.txt
oocalc output.csv

Change the permission on the file to 744 and run it:

> ./script.txt

Once it executes, all you need to do is finish up creating the graph in Calc.

Many Web pages have a text-search box. Use it to your advantage by putting in your search text and using the generated URL with the Wget command. This technique is great for spanning multiple pages and even multiple sites.

Real-life Web pages can be complicated. Understanding Wget, grep, gawk, and sed will help you make adjustments and retrieve the right data.

SQL, XML, and RSS are all the buzz today, but the lost art of screen (Web) scraping still has a useful place in today's business environment. That's especially true as the Linux desktop continues to march into the office world.

Rob Reilly is a consultant, trend spotter, and writer. His company advises clients in a variety of business and technology areas.

Share    Print    Comments   

Comments

on Extract data from the Internet with Web scraping

Note: Comments are owned by the poster. We are not responsible for their content.

data mining - additional options

Posted by: Anonymous Coward on March 30, 2006 05:30 AM
I do a lot of data mining from websites for a variety of reasons. I tend to use a combination of Perl and wget mostly. I call wget from within Perl scripts and use Perl's powerful text manipulation methods to locate the data I'm interested in. I find that massaging the text within a perl script rather than saving to multiple files or piping constantly is more efficient and easier to understand.

I'm not saying there's anything wrong with the above method, just offering another solution.

#

www::mechanize and lwp

Posted by: Anonymous Coward on March 30, 2006 06:20 AM
If you are going to do any amount of screen scraping on the web you really should use the Perl modules www::mechanize which is built on top of LWP and then use the various html parsers.

It is amazing what Mechanize can do just by itself.

#

BeautifulSoup

Posted by: Anonymous Coward on March 30, 2006 05:15 PM
And if Python is your poison try <a href="http://www.crummy.com/software/BeautifulSoup/" title="crummy.com">BeautifulSoup</a crummy.com>.

#

Using Python

Posted by: Anonymous Coward on March 31, 2006 12:47 AM
Here's how I'd do it, using Python:

#-------start of code --------------
import urllib, re, csv

URL = 'http://home.earthlink.net/~robreilly/index-demo.<nobr>h<wbr></nobr> tm'
output = csv.writer(file('c:/temp/output.csv', 'wb'))

content = urllib.urlopen(URL).read()
data = re.findall('(.*).*\((.*) page views\)', content)
output.writerows(data)
#------- end of code --------------

This just creates the CSV file -- it doesn't open OpenOffice for you. Another difference: in my version, I let the regexp do the field separation.

#

Back Pain relief

Posted by: Anonymous Coward on May 28, 2006 05:45 PM
<tt>[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]
[URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]
[URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.c<nobr>o<wbr></nobr> m] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/pa<nobr>i<wbr></nobr> nrelief.htm] Nerve pain relief [/URL]</tt>

#

Data Terrorism.

Posted by: Anonymous Coward on March 31, 2006 01:02 AM
I'm wondering if all this "data mining" in this day and age could be misinterpreted as an attack on the sites being mined?

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya