Customizing Readerware Extraction

First off, this is not for everyone. Customizing the Readerware extraction process means getting your hands dirty and writing some scripting code to massage the data extracted. But for those of you that have experience writing scripts, this is an extremely powerful feature.

So why would you want to customize the extraction? There are a number of things you can do. One user wanted to change the titles and strip off  leading A's. For example instead of "A Small Deceit" change the title to "Small Deceit". Another user didn't like Readerware extracting categories, she wanted to use her own. You can even substitute your category. Someone else wanted all new books to be set to Read. All this is possible with a simple script.

The way this is implemented is that Readerware will call a Python script called userexit.py after it extracts data from a web site and before it adds a book to the database. Using this script you can customize the data. You will find a basic copy of this script in your readerware\scrapers directory.

It is called userexit_sample.py, you must rename it to userexit.py.

Mac OS X users will find the scrapers folder inside the application package. Control-click on the Readerware program icon. Select Show package contents from the popup menu. Double click on Contents, Resources, Java, scrapers.

Here is what it looks like:

# Scraper user exit.
#
# Copyright © 1999-2009 Readerware Corporation. All Rights Reserved.
#
# To activate this sample userexit script it must be renamed to
# userexit.py. If userexit.py exists it is called immediately
# before the scraper process returns. You can change any of the
# global variables to customize the extraction process.
# More info: http://www.readerware.com/help/rwCustomExtraction.html
#
#
# This is the basic userexit.py script, it does nothing by itself.
# You can add your statements to customize the extracted data.

import string

def userextract():
global title,author,isbn,publisher,format,first,signed,date,place
global copies,rating,condition,category,read,pflag,eflag,value
global comments,dateEntered,dataSource,cart,ordered
global lccn,dewey,userNumber,copyDate,valueDate,location
global series,pages,keywords,dimensions
global user1,user2,user3,user4,user5,user6,user7,user8,user9,user10
global author2,author3,author4,author5,author6
global usedprice,usedcount,collectibleprice,collectiblecount
global newprice,newcount,listprice,readinglevel,salesrank,available
global buyerwaiting,editionNumber,weight,image
global fullDateFormat,source
global callnumber

# Add your statements here

userextract()
By itself this script does nothing, but it is the starting point for developing your own scripts. Note the global statements. These identify the global variable names that Readerware uses, in other words the variable "title" contains the extracted title etc. This is really all you need to know about how the process works, you need to set or change the contents of the variables to the required data. So for example, if you don't want Readerware to extract categories from a web site, you could add the following line at the end of the script:
    category = ""
For something a little trickier, suppose you wanted to map the categories extracted from a web site to your own categories:
    if (string.find(category, "Mystery") != -1):
        category = "My Mystery Category"
You would need to add these kinds of statements for every category and every web site. You can probably see the basic idea, check for a string in the extracted category, if found replace the category with another. If you want to change the title as described earlier:
    if (title[0:2] == "A "):
        title = title[2:]
This may all look very strange, the script is written in the Python language. If you know Python, you're all set. If you know another scripting language like Perl, it shouldn't be much of a challenge.

Optional Fields

One use for a custom extraction script is to move optional fields to the database. Readerware extracts a number of optional fields. As there are no database columns for these optional fields, you have to store them in a user defined column if you need this data.

The process is very simple. First define a user column for the data. Select the Preferences menu item, User Columns tab. Enter the column title and check the active box for each user defined column you want to add.

Next you have to tell Readerware to move the data to this user column. You do this in a custom extraction script. For example the following line will store the reading level in the first user defined column:

user1 = readinglevel

A sample script is provided in your readerware\scrapers directory. It is called userexit_prices.py. You need to rename this script to userexit.py. You can use it as is or modify to meet your needs.

Mac OS X users will find the scrapers folder inside the application package. Control-click on the Readerware program icon. Select Show package contents from the popup menu. Double click on Contents, Resources, Java, scrapers.

This script sets up the following optional fields:

user1 = usedprice
user2 = usedcount
user3 = collectibleprice
user4 = collectiblecount
user5 = newprice
user6 = newcount
user7 = listprice
user8 = salesrank
user9 = buyerwaiting

The optional fields are:

available
Item is available
buyerwaiting
Buyer waiting for item
callnumber
Library of Congress Call Number
collectiblecount Number of collectible items available
collectibleprice Lowest collectible price
listprice List price
newcount Number of new items available
newprice Lowest new price
readinglevel Reading Level
salesrank Sales Rank
usedcount Number of used items available
usedprice Lowest Used Price
editionNumber
Edition Number
weight
Shipping Weight

If you want to extract any of this data, just move the data to a user defined column in your custom extraction script. Use the field name shown above in your script.

Learning Python

There are a lot of resources available on the web to help you with Python and a lot of books available too. Just fire up your browser and search for Python titles at your favorite book retailer. A good place to start your web search is at the official Python site www.python.org.

Note that you don't have to install Python, all necessary libraries are included with the Readerware distribution.

Python is a very powerful language and fairly easy to learn. If you're wondering about the name, yes it was named after Monty. Unfortunately I cannot offer support on Python itself. You will need to discover the power of Python for yourself.

A book I really like is "Learning Python by Mark Lutz", it has a very readable approach, covers the basics and advanced topics. The "Python Pocket Reference by Mark Lutz" is a handy thing to keep by your keyboard. Another good one is "Text Processing in Python by David Mertz". A friend recommends "Python Programming on Win 32 by Mark Hammond", it covers Python with particular emphasis on using it with Windows.
 

Debugging Your Script

Even the best Python programmer is going to make a mistake once in a while. Fortunately it is very easy to debug your scripts with Readerware. First, start Readerware, go to General Preferences and ensure the User logging check box is checked. You must restart Readerware when you change this option.

Use Readerware as normal. When extracting data Readerware will output debugging information and any error messages to a log file, rwuser.log. You can view this file in any text editor.

Also with debug on, Readerware will write the HTML file it retrieved from the web site to the Readerware directory as trace.html. This can be useful sometimes when debugging scripts.

 

Top of Page


Copyright © 1999-2009 Readerware Corporation