Customizing ReaderwareVW Extraction

First off, this is not for everyone. Customizing the ReaderwareVW extraction process means getting your hands dirty and writing some scripting code to massage the data extracted. But for those of you that have experience writing scripts, this is an extremely powerful feature.

So why would you want to customize the extraction? There are a number of things you can do. Some users wanted the title in uppercase. Another user didn't like ReaderwareVW extracting categories, she wanted to use her own. You can even substitute your category. Someone else wanted all new CDs to be set to Played. All this is possible with a simple script.

The way this is implemented is that ReaderwareVW will call a Python script called vwuserexit.py after it extracts data from a web site and before it adds a CD to the database. Using this script you can customize the data. You will find a basic copy of this script in your readerware\scrapers directory.

It is called vwuserexit_sample.py, you must rename it to vwuserexit.py.

Mac OS X users will find the scrapers folder inside the application package. Control-click on the ReaderwareVW program icon. Select Show package contents from the popup menu. Double click on Contents, Resources, Java, scrapers.

Here is what it looks like:

# Scraper user exit.
#
# Copyright © 1999-2009 Readerware Corporation. All Rights Reserved.
#
# To activate this sample vwuserexit script it must be renamed to
# vwuserexit.py. If vwuserexit.py exists it is called immediately
# before the scraper process returns. You can change any of the
# global variables to customize the extraction process.
# More info: http://www.readerware.com/help/vwCustomExtraction.html
#
#
# This is the basic vwuserexit.py script, it does nothing by itself.
# You can add your statements to customize the extracted data.
#

import string

def userextract():
global title,actor1,actor2,actor3,actor4,actor5,actor6
global actor7,actor8,actor9,actor10,director,writer
global screenwriter,photographer,composer,editor,series
global upc,isbn,lccn,dewey,userNumber,format,studio,place
global date,copyDate,mpaa,wide,closedCap,sound,copies
global rating,condition,category,viewed,pflag,eflag,value
global valueDate,comments,dateEntered,dataSource,cart,ordered
global copies,location,keywords,book,author,running,color
global track1,track2,track3,track4,track5
global track6,track7,track8,track9,track10
global track11,track12,track13,track14,track15
global track16,track17,track18,track19,track20
global user1,user2,user3,user4,user5,user6,user7,user8,user9,user10
global usedprice,usedcount,collectibleprice,collectiblecount
global newprice,newcount,listprice,salesrank,available
global buyerwaiting,editionNumber,image,fullDateFormat,source

# Add your statements here


userextract()
By itself this script does nothing, but it is the starting point for developing your own scripts. Note the global statements. These identify the global variable names that ReaderwareVW uses, in other words the variable "title" contains the extracted title etc. This is really all you need to know about how the process works, you need to set or change the contents of the variables to the required data. So for example, if you don't want ReaderwareVW to extract categories from a web site, you could add the following line at the end of the script:
    category = ""
For something a little trickier, suppose you wanted to map the categories extracted from a web site to your own categories:
    if (string.find(category, "Mystery") != -1):
        category = "Movies : Mystery"
You would need to add these kinds of statements for every category and every web site. You can probably see the basic idea, check for a string in the extracted category, if found replace the category with another.

If you want to uppercase the title:
    title = string.upper(title)
This may all look very strange, the script is written in the Python language. If you know Python, you're all set. If you know another scripting language like Perl, it shouldn't be much of a challenge.

Optional Fields

One use for a custom extraction script is to move optional fields to the database. ReaderwareVW extracts a number of optional fields. As there are no database columns for these optional fields, you have to store them in a user defined column if you need this data.

The process is very simple. First define a user column for the data. Select the Preferences menu item, User Columns tab. Enter the column title and check the active box for each user defined column you want to add.

Next you have to tell ReaderwareVW to move the data to this user column. You do this in a custom extraction script. For example the following line will store the sales ranking in the first user defined column:

user1 = salesrank
The optional fields are:


available
Item is available
buyerwaiting
Buyer waiting for item
collectiblecount Number of collectible items available
collectibleprice Lowest collectible price
listprice List price
newcount Number of new items available
newprice Lowest new price
salesrank Sales Rank
usedcount Number of used items available
usedprice Lowest Used Price
editionNumber
Edition Number

If you want to extract any of this data, just move the data to a user defined column in your custom extraction script. Use the field name shown above in your script.

Learning Python

There are a lot of resources available on the web to help you with Python and a lot of books available too. Just fire up your browser and search for Python titles at your favorite book retailer. A good place to start your web search is at the official Python site www.python.org.

Note that you don't have to install Python, all necessary libraries are included with the Readerware distribution.

Python is a very powerful language and fairly easy to learn. If you're wondering about the name, yes it was named after Monty. Unfortunately I cannot offer support on Python itself. You will need to discover the power of Python for yourself.

A book I really like is "Learning Python by Mark Lutz", it has a very readable approach, covers the basics and advanced topics. The "Python Pocket Reference by Mark Lutz" is a handy thing to keep by your keyboard. Another good one is "Text Processing in Python by David Mertz". A friend recommends "Python Programming on Win 32 by Mark Hammond", it covers Python with particular emphasis on using it with Windows.
 

Debugging Your Script

Even the best Python programmer is going to make a mistake once in a while. Fortunately it is very easy to debug your scripts with ReaderwareVW. First, start ReaderwareVW, go to General Preferences and ensure the User logging check box is checked. You must restart ReaderwareVW when you change this option.

Use ReaderwareVW as normal. When extracting data ReaderwareVW will output debugging information and any error messages to a log file, rwuser.log. You can view this file in any text editor.

Also with debug on, ReaderwareVW will write the HTML file it retrieved from the web site to the Readerware directory as trace.html. This can be useful sometimes when debugging scripts.

 

Top of Page


Copyright © 1999-2009 Readerware Corporation