Table of Contents

Configuring Oracle Secure Enterprise Search to Work With screen-scraper™


Overview

The screen-scraper™ for Oracle SES Plug-in provides a simple but flexible way to interface screen-scraper™ with Oracle Secure Enterprise search. It is generally unnecessary to write Java code to make this link happen, and, when coding is required, it is minimal.

In order to interface Oracle SES with screen-scraper™, it is first necessary to create a scraping session in screen-scraper™. A scraping session simply contains the instructions needed to tell screen-scraper™ how to extract data from a specific web site. Aside from adhering to certain naming conventions, the scraping session can be built as any other would within screen-scraper™. See Creating Scraping Sessions to Work With Oracle Secure Enterprise Search for more details on this.

Once the scraping session has been created, interfacing Oracle SES with screen-scraper™ is done via the plug-in using one of two methods:

  • In cases where the scraping session can be invoked without receiving parameters passed in from an external source (e.g., a database), all configuration can be done via the standard Oracle SES web interface. This is known as a "static source".
  • If parameters are to be passed in to the scraping session dynamically (e.g., drawn from a database or external file), a small Java class must be written to handle retrieving the data from the external source, then invoke the scraping process. We'll refer to this type as a "dynamic source".

While screen-scraper™ traverses the target web site, extracting the required data, that data is passed in real-time to Oracle SES for indexing. That is, the data will get indexed while the scraping process is running.

Configuring Secure Enterprise Search

The first step to configuring Oracle SES to work with screen-scraper™ is to create a source type within the Oracle SES web interface. To do that, follow these steps:

  1. Pull up the administrative web interface for your Oracle SES interface and log in to it.
  2. Click on the "Global Settings" tab in the upper-right corner.
  3. Click the "Source Types" link.
  4. Click the "Create" button.
  5. Enter values in the four text boxes as you see them in the screen-shot below (the full value for the third parameter is "com.screenscraper.ses.plugin.ScreenScraperCrawlerPluginManager"):
  6. Click the "Next" button. You'll see that Oracle SES has identified several parameters you'll use to configure it to work with screen-scraper™.
  7. Click the "Finish" button.

When you create a source for this source type you'll need to set at least a few of the parameters you saw in creating the source type. Those parameters are described in more detail below:

  • Site URL: The primary URL of the site that will be scraped by screen-scraper™ (e.g., http://www.screen-scraper.com/shop/).
  • screen-scraper host: The host name of the machine on which screen-scraper™ is running (e.g., localhost).
  • screen-scraper port: The port number on which screen-scraper™ is listening (the default is 8778).
  • Scraping session name: The name of the scraping session to be invoked (e.g., Shopping Site).
  • Session variables: A list of key/value pairs as they would appear in the query portion of a URL (e.g., foo=bar&boo=bap). The keys and values must be URL-encoded.
  • Queueable scraping session class: The name of the class to be invoked using the second method. An example of using this parameter is given below.

Setting Up a Static Source

As mentioned, there are two ways to interface Oracle SES with screen-scraper™. The first method can be done using only the standard Oracle SES web interface, and assumes that you won't be passing parameters to screen-scraper™ dynamically (e.g., from a database or text file). As an example, let's suppose we wanted to configure Oracle SES to invoke the "Shopping Site" scraping session we created in our third tutorial, and that you may have modified in the Creating Scraping Sessions to Work With Oracle Secure Enterprise Search section. You would do this within Oracle SES by following these steps:

  1. Assuming you're logged in to the administrative web interface for Oracle SES, click on the "Sources" tab in the upper-left corner
  2. In the source type drop-down list select "screen-scraper™ for Oracle SES", then click the "Create" button.
  3. In the next screen enter the values in the text boxes as you see them in the following screen-shot (be sure to un-check the "Start Crawling Immediately" check box):
  4. Click the "Create" button.

Let's go over what the values for these parameters indicate. For the "Site URL" we simply indicate the primary URL of the site we're scraping. The "screen-scraper host" and "screen-scraper port" parameters indicate the machine on which screen-scraper is running, and the port on which it's listening. We just went with the defaults in this case, but it's likely apparent that we could just as easily run screen-scraper™ on a different server, and have Oracle SES access it over the network. We indicate that we want to run our "Shopping Site" scraping session using the next parameter. The "Session variables" parameter is used to set session variables that will be sent to screen-scraper when the scraping session runs. You might recall from the third tutorial that we set a session variable to indicate the starting page number of search results (the "PAGE") session variable, and one to indicate the search term ("bug", in this case).

If you'd like to actually invoke this source there are a few things you'll need to do first. Assuming you've created and modified the "Shopping Site" scraping session as described in Creating Scraping Sessions to Work With Oracle Secure Enterprise Search, you'll need to make sure it's found in the instance of screen-scraper™ you'll be invoking (i.e., the one Oracle SES will be invoking). This can be done by importing the scraping session into screen-scraper™. Once it's imported, just make sure screen-scraper™ is running in server mode.

Assuming the scraping session has been imported and screen-scraper™ is running, you could now invoke the source by clicking the "Schedules" tab in the Oracle SES web interface, selecting the "Shopping Site Static" source, and clicking the "Start" button. If you then view the log file generated by Oracle SES, you should see lines containing data that was scraped by screen-scraper™ and sent to Oracle SES for indexing. Remember that you can also view screen-scraper™'s log files by looking in its "log" folder.

Setting Up a Dynamic Source

If the site to be scraped requires parameters to be passed to it from an external source (e.g., a database or text file), you'll need to set up a dynamic source. For example, it may be desirable to invoke the scraping session multiple times by passing in different search parameters to be using the scraping process.

If more flexibility is needed in how screen-scraper™ gets invoked, you'll need to create a simple Java class that queues up the scraping sessions to be run. The Java class must implement the com.screenscraper.ses.scraper.QueueableScrapingSessionQueue interface (found in the ss4ses.jar file), which requires a single method:

public Set getQueueableScrapingSessions();

The Set returned by the method must hold instances of com.screenscraper.ses.scraper.QueueableScrapingSession. Consider the following Java class, which fully exemplifies the code that would need to be written:

Hopefully the comments in the source code make it obvious what's happening. We're simply creating three QueueableScrapingSession objects that will be used to invoke screen-scraper™ multiple times. That is, our "Shopping Site" scraping session will get invoked once for the each search term "blade", "dvd", and "bug". You'll notice that very little code specific to Oracle SES or screen-scraper™ is required. Here we're manually generating the values passed to the QueueableScrapingSession objects, but those values could just as easily be retrieved from a database, text file, or any other external source.

Creating a dynamic source within the Oracle SES web interface is identical to creating a static source (as described above), except that you'll enter a value for the "Queuable scraping sessions class" parameter. In the case above you would enter "com.screenscraper.ses.plugin.test.TestQueueableScrapingSessionQueue" as the value. Second, you'll need to create a jar file containing your class, and copy it into the Oracle SES plug-ins folder (as described in Installing screen-scraper™ for Oracle Secure Enterprise Search).

If you were to create a source using the class given above, and invoke it within Oracle SES, you would see that it would scrape and index data for each of the three search terms given.