Thursday, October 15, 2009

Assignment Notes – Handling and Parsing Text

This past week our assignment for ICM was to develop a simple sketch that leverages the String parsing functionality, which we reviewed during last week’s class, to capture information from sources on the internet. For my sketch I decided to create a simple viewer that displays two New York Times lists – the most searched terms and most emailed articles lists. The list is designed to follow the mouse and scroll through items when the mouse is pressed. Here is a link to functioning application that I created (make sure to look at this work using safari rather than firefox).

New York Time Most List

In the development of this sketch I limited myself to using the following string functions: indexOf() and substring(). By far the hardest part of this project was being able to parse the webpage to capture the specific bits of data that I was looking for. Below is a quick overview of the process I used to develop the code that reads and parses content from the most emailed list. This process aligns closely with the one I outlined in my class notes earlier this week.

Before I delve into the details, I want to state the importance of choosing a good source for your data. I was lucky because the pages of the New York Times are well structured, which makes the data parsing process much easier. Here is a brief overview of the process:
  1. Examining source code: once the sources had been chosen I examined the code to locate identifiers that would enable me to extract the information I was searching.
  2. Developing pseudocode: once I had found the bits of information that I was looking for in the source code I created pseudocode to walk through the algorithm for reading extracting the data from the string. Note that I had already designed the display. The pseudocode is featured below.
  3. Building the sketch: this is where I encountered the most issues in this process, which were related to my carelessness in transcribing the text required to identify the pieces of data that I wanted to capture.
Pseudo Code for Copy Animation
To read the title, author, and date for each item in the list I will create a loop that cycles 25 times (from 0 to 24). This loop will find the information associated to each article and save it into one of three data arrays. For each element on the list we will capture a title, date and category. Here is how we will parse the file within each loop:

First look for the div tag that marks the beginning of the content from each element in the list (class mpEntry). This tag features the item number, which correspond to the item’s ranking in the list. To use this tag dynamically insert a number into the string by converting the loop counter to a string and then adding it into tag.

The next data element is the date. The date is always located within a span tag that has a class of “date”. Once we found the index of this tag we need to add the length of the tag to identify the starting point of the date itself.

To find the end point of the date search for the span close tag. Make sure to set the nytEmailindexHolder variable to the end point that has just been located so that the search for the next piece of data begins at the right place

Next we will read the title. To find the start location of the title, first search for the start location of the link that precedes it - this link ends with “?em”>”. Then add 5 (the length of the “?em”>” string) to the index location just found above.

To find the end of the title, locate the link close tag that precedes the title. Make sure to set the search start location to the beginning to the title string we just captured.

Lastly, to capture the author information, find the div tag that has “byline” as its class – using the same strategy as above. We will then look for the closing of the “div” tags to set the end point of this string.

No comments: