![]() |
Table of Contents |
![]() |
Configuring Oracle Secure Enterprise Search to Work With screen-scraper™ |
OverviewThe screen-scraper™ for Oracle SES Plug-in provides a simple but flexible way to interface screen-scraper™ with Oracle Secure Enterprise search. It is generally unnecessary to write Java code to make this link happen, and, when coding is required, it is minimal. In order to interface Oracle SES with screen-scraper™, it is first necessary to create a scraping session in screen-scraper™. A scraping session simply contains the instructions needed to tell screen-scraper™ how to extract data from a specific web site. Aside from adhering to certain naming conventions, the scraping session can be built as any other would within screen-scraper™. See Creating Scraping Sessions to Work With Oracle Secure Enterprise Search for more details on this. Once the scraping session has been created, interfacing Oracle SES with screen-scraper™ is done via the plug-in using one of two methods:
While screen-scraper™ traverses the target web site, extracting the required data, that data is passed in real-time to Oracle SES for indexing. That is, the data will get indexed while the scraping process is running. Configuring Secure Enterprise SearchThe first step to configuring Oracle SES to work with screen-scraper™ is to create a source type within the Oracle SES web interface. To do that, follow these steps:
When you create a source for this source type you'll need to set at least a few of the parameters you saw in creating the source type. Those parameters are described in more detail below:
Setting Up a Static SourceAs mentioned, there are two ways to interface Oracle SES with screen-scraper™. The first method can be done using only the standard Oracle SES web interface, and assumes that you won't be passing parameters to screen-scraper™ dynamically (e.g., from a database or text file). As an example, let's suppose we wanted to configure Oracle SES to invoke the "Shopping Site" scraping session we created in our third tutorial, and that you may have modified in the Creating Scraping Sessions to Work With Oracle Secure Enterprise Search section. You would do this within Oracle SES by following these steps:
Let's go over what the values for these parameters indicate. For the "Site URL" we simply indicate the primary URL of the site we're scraping. The "screen-scraper host" and "screen-scraper port" parameters indicate the machine on which screen-scraper is running, and the port on which it's listening. We just went with the defaults in this case, but it's likely apparent that we could just as easily run screen-scraper™ on a different server, and have Oracle SES access it over the network. We indicate that we want to run our "Shopping Site" scraping session using the next parameter. The "Session variables" parameter is used to set session variables that will be sent to screen-scraper when the scraping session runs. You might recall from the third tutorial that we set a session variable to indicate the starting page number of search results (the "PAGE") session variable, and one to indicate the search term ("bug", in this case). If you'd like to actually invoke this source there are a few things you'll need to do first. Assuming you've created and modified the "Shopping Site" scraping session as described in Creating Scraping Sessions to Work With Oracle Secure Enterprise Search, you'll need to make sure it's found in the instance of screen-scraper™ you'll be invoking (i.e., the one Oracle SES will be invoking). This can be done by importing the scraping session into screen-scraper™. Once it's imported, just make sure screen-scraper™ is running in server mode. Assuming the scraping session has been imported and screen-scraper™ is running, you could now invoke the source by clicking the "Schedules" tab in the Oracle SES web interface, selecting the "Shopping Site Static" source, and clicking the "Start" button. If you then view the log file generated by Oracle SES, you should see lines containing data that was scraped by screen-scraper™ and sent to Oracle SES for indexing. Remember that you can also view screen-scraper™'s log files by looking in its "log" folder. Setting Up a Dynamic SourceIf the site to be scraped requires parameters to be passed to it from an external source (e.g., a database or text file), you'll need to set up a dynamic source. For example, it may be desirable to invoke the scraping session multiple times by passing in different search parameters to be using the scraping process. If more flexibility is needed in how screen-scraper™ gets invoked, you'll need to create a simple Java class that queues up the scraping sessions to be run. The Java class must implement the com.screenscraper.ses.scraper.QueueableScrapingSessionQueue interface (found in the ss4ses.jar file), which requires a single method: public Set getQueueableScrapingSessions(); The Set returned by the method must hold instances of com.screenscraper.ses.scraper.QueueableScrapingSession. Consider the following Java class, which fully exemplifies the code that would need to be written: Hopefully the comments in the source code make it obvious what's happening. We're simply creating three QueueableScrapingSession objects that will be used to invoke screen-scraper™ multiple times. That is, our "Shopping Site" scraping session will get invoked once for the each search term "blade", "dvd", and "bug". You'll notice that very little code specific to Oracle SES or screen-scraper™ is required. Here we're manually generating the values passed to the QueueableScrapingSession objects, but those values could just as easily be retrieved from a database, text file, or any other external source. Creating a dynamic source within the Oracle SES web interface is identical to creating a static source (as described above), except that you'll enter a value for the "Queuable scraping sessions class" parameter. In the case above you would enter "com.screenscraper.ses.plugin.test.TestQueueableScrapingSessionQueue" as the value. Second, you'll need to create a jar file containing your class, and copy it into the Oracle SES plug-ins folder (as described in Installing screen-scraper™ for Oracle Secure Enterprise Search). If you were to create a source using the class given above, and invoke it within Oracle SES, you would see that it would scrape and index data for each of the three search terms given. |










