Script Scraper

Pull Data from the Internet into Excel Spreadsheets.

Baseball Tutorial

The Baseball Reference website is a popular source of information on the sport. This example will show how to get the "Team & League Standard Batting" table data into Excel format as shown below.

Downloading the Data.

default xml namespace ="";
var xml = downloadAsXML("");
To create the XPath expression needed to extract the data from the xml we will have a look at html source of the page. First use the mouse to select the start of the table in the browser and then right click and view selection source. In the source window use the find feature and search for the start of a html table '<table'.
<table data-freeze="1" class="sortable  stats_table" id="teams_standard_batting">
Now we know that we can identify that particular table by using the id attribute 'teams_standard_batting'. Looking on down the source html we see that the 'tbody' is used to hold the main data of the table and we want the rows inside the tbody.
All elements names are given in UPPER case. To check so far that we are on the correct track we can display each row of the table to the console.
default xml namespace ="";
var xml = downloadAsXML("");

var result = selectDataFromXML('//TABLE[@id="teams_standard_batting"]/TBODY/TR', xml);

for each (var TR in result.TR) {    
The console displays.
<TR xmlns="" xmlns:OG=""
xmlns:FB="" class="">
  <TD align="left">
       <A href="/teams/BAL/2011.shtml" title="Baltimore Orioles">BAL</A>
  <TD align="right">50<TD>
  <TD align="right">28.4<TD>
  <TD align="right">4.37<TD>
  <TD align="right">162<TD>
  <TD align="right">6156<TD>
  <TD align="right">5585<TD>
  <TD align="right">708<TD>
  <TD align="right">1434<TD>
We replace the for loop with code that extracts the data from the table row named TR.

This code uses the cool E4X facility that is available in the latest versions of JavaScript.

A JavaScript hashtable named 'row' holds extracted information with keys: code, name, number and age.

for each (var TR in result.TR) {    
    var row = {};
    row.Code   = TR.TD[0].A.text();
    row.Name   = TR.TD[0].A.@title.text();
    row.Number = TR.TD[1].text();
    row.Age    = TR.TD[2].text();
    out.println("Name = '" + row.Name + "' Code = " + row.Code + " Number = " + row.Number + " Age " + row.Age);
Now the console output looks much more human friendly.
Name = 'Baltimore Orioles' Code = BAL Number = 50 Age 28.4
Name = 'Boston Red Sox'    Code = BOS Number = 49 Age 30.0
Name = 'Chicago White Sox' Code = CHW Number = 43 Age 29.9
Name = 'Cleveland Indians' Code = CLE Number = 46 Age 27.8
Name = 'Detroit Tigers'    Code = DET Number = 45 Age 28.6
Name = 'Kansas City Royals' Code = KCR Number = 42 Age 25.8

Storing Data in Excel

To get this information into Excel we append each row onto the end of an array which is passed over to Excel.

var excelDataRows = [];
for each (var TR in result.TR) {    
    var row = {};
    row.Code   = TR.TD[0].A.text();
    row.Name   = TR.TD[0].A.@title.text();
    row.Number = TR.TD[1].text();
    row.Age    = TR.TD[2].text();
Now we create the ExcelWorkSheet object, providing a sheet name and a list of labels that are to be displayed on the sheet. The addRows method is passed the data array as the first argument. The second argument is a list of titles that match up with the data in each row previously.
default xml namespace ="";
var baseballSheet = new ExcelWorkSheet("Baseball", ["Team Standard Batting"]);   
baseballSheet.addRows(excelDataRows, ["Name", "Code", "Number", "Age"]);
A new ExcelSpreadSheet object is created, the sheet and the template file are passed to the constructor.
var excel = new ExcelSpreadSheet(baseballSheet, "batting_template.xls");
out.println("Wrote file baseball.xls");
Now we run the script and the new spreadsheet will created in the workspace.

We hope you enjoyed the tutorial.