So why would you want to customize the extraction? There are a number of things you can do. One user wanted to change the titles and strip off leading A's. For example instead of "A Small Deceit" change the title to "Small Deceit". Another user didn't like Readerware extracting categories, she wanted to use her own. You can even substitute your category. Someone else wanted all new books to be set to Read. All this is possible with a simple script.
The way this is implemented is that Readerware will call a Python
script called userexit.py
after it extracts data from a web site and before it adds a book to the
database. Using this script you can customize the data. You will find a
basic copy of this script in your readerware\scrapers
directory.
It is called userexit_sample.py, you must rename it to userexit.py.
Mac OS X users will find the scrapers folder inside the application package. Control-click on the Readerware program icon. Select Show package contents from the popup menu. Double click on Contents, Resources, Java, scrapers.
Here is what it looks like:
# Scraper user exit.By itself this script does nothing, but it is the starting point for developing your own scripts. Note the global statements. These identify the global variable names that Readerware uses, in other words the variable "title" contains the extracted title etc. This is really all you need to know about how the process works, you need to set or change the contents of the variables to the required data. So for example, if you don't want Readerware to extract categories from a web site, you could add the following line at the end of the script:
#
# Copyright © 1999-2009 Readerware Corporation. All Rights Reserved.
#
# To activate this sample userexit script it must be renamed to
# userexit.py. If userexit.py exists it is called immediately
# before the scraper process returns. You can change any of the
# global variables to customize the extraction process.
# More info: http://www.readerware.com/help/rwCustomExtraction.html
#
#
# This is the basic userexit.py script, it does nothing by itself.
# You can add your statements to customize the extracted data.
import string
def userextract():
global title,author,isbn,publisher,format,first,signed,date,place
global copies,rating,condition,category,read,pflag,eflag,value
global comments,dateEntered,dataSource,cart,ordered
global lccn,dewey,userNumber,copyDate,valueDate,location
global series,pages,keywords,dimensions
global user1,user2,user3,user4,user5,user6,user7,user8,user9,user10
global author2,author3,author4,author5,author6
global usedprice,usedcount,collectibleprice,collectiblecount
global newprice,newcount,listprice,readinglevel,salesrank,available
global buyerwaiting,editionNumber,weight,image
global fullDateFormat,source
global callnumber
# Add your statements here
userextract()
category = ""For something a little trickier, suppose you wanted to map the categories extracted from a web site to your own categories:
if (string.find(category, "Mystery") != -1):You would need to add these kinds of statements for every category and every web site. You can probably see the basic idea, check for a string in the extracted category, if found replace the category with another. If you want to change the title as described earlier:
category = "My Mystery Category"
if (title[0:2] == "A "):This may all look very strange, the script is written in the Python language. If you know Python, you're all set. If you know another scripting language like Perl, it shouldn't be much of a challenge.
title = title[2:]
The process is very simple. First define a user column for the data. Select the Preferences menu item, User Columns tab. Enter the column title and check the active box for each user defined column you want to add.
Next you have to tell Readerware to move the data to this user column. You do this in a custom extraction script. For example the following line will store the reading level in the first user defined column:
user1 = readinglevel
A sample script is provided in your readerware\scrapers
directory. It is called userexit_prices.py
. You need to
rename this script to userexit.py
. You can use it as is
or modify
to meet your needs.
Mac OS X users will find the scrapers folder inside the application
package. Control-click on the Readerware program icon. Select Show
package contents from the popup menu. Double click on Contents,
Resources, Java, scrapers.
This script sets up the following optional fields:
user1 = usedprice
user2 = usedcount
user3 = collectibleprice
user4 = collectiblecount
user5 = newprice
user6 = newcount
user7 = listprice
user8 = salesrank
user9 = buyerwaiting
The optional fields are:
available |
Item is available |
buyerwaiting |
Buyer waiting for item |
callnumber |
Library of Congress Call Number |
collectiblecount |
Number of collectible items
available |
collectibleprice |
Lowest collectible price |
listprice |
List price |
newcount |
Number of new items available |
newprice |
Lowest new price |
readinglevel |
Reading Level |
salesrank |
Sales Rank |
usedcount |
Number of used items available |
usedprice |
Lowest Used Price |
editionNumber |
Edition Number |
weight |
Shipping Weight |
Note that you don't have to install Python, all necessary libraries are included with the Readerware distribution.
Python is a very powerful language and fairly easy to learn. If you're wondering about the name, yes it was named after Monty. Unfortunately I cannot offer support on Python itself. You will need to discover the power of Python for yourself.
A book I really like is "Learning Python by Mark Lutz", it has a
very
readable approach, covers the basics and advanced topics. The "Python
Pocket
Reference by Mark Lutz" is a handy thing to keep by your keyboard.
Another good one is "Text Processing in Python by David Mertz". A
friend
recommends "Python Programming on Win 32 by Mark Hammond", it covers
Python
with particular emphasis on using it with Windows.
Use Readerware as normal. When extracting data Readerware will output debugging information and any error messages to a log file, rwuser.log. You can view this file in any text editor.
Also with debug on, Readerware will write the HTML file it retrieved from the web site to the Readerware directory as trace.html. This can be useful sometimes when debugging scripts.