Journal Browser

The Journal Browser was completed as the final project in my LIBR 246 course.  In deciding about what to do for my project, I was inspired by the “Life after MLIS” event hosted by the American Library Association Student Chapter.  In discussing doctoral programs, one of the panelists mentioned that one of the primary ways to decide what to do for a dissertation is to read widely in an area of interest to learn more about what work has been done and what topics are of interest in the field.  One of the ways to do this is to browse scholarly journals in a field. However, much research today is conducted online through databases. While databases make searches easier to conduct, browsing is often made more difficult.

Process

In order to browse multiple journals in a topic of interest through online databases, one must first locate the current table of contents for the journal.  This is sometimes available on the journal’s website, but accessing it there means there is no easy access point to your library’s database if you are interested in reading the full article. In order to access the article from the journal’s website, one must either manually adjust the link to include proxy information, or must visit the library’s catalog, search for the journal title, log into the database, then find the article of interest.  A second method for browsing a recent issue is to search for the journal in the library’s catalog, log into the database, then find the most recent issue.  No matter the method, to browse multiple journals, the process must be repeated for each journal of interest.

Instead, it would be better to have a common place from which journals of interest can be browsed.  The journal browser which I built allows a user to browse the most recent issues for a set of scholarly journals.  As a proof of concept, the journal browser provides access to five scholarly journals in the field of library and information science.  The user selects a journal to browse and is presented with the list of articles from the table of contents from the most recent issue, including the article title, authors, and, when available, the abstract.  There is also a content cloud provided which is based on the abstracts for each of the articles in the issue.  If the user clicks on one of the content tags, the browser will show only the articles with the selected topic in its abstract. The user can easily remove the topic from selection to once again see all of the articles.  The user can navigate to the web page where the table of contents can be found by clicking on the volume/issue information. The user can also learn more about the journal by clicking on the journal title, which will navigate to the wiki entry for the journal at the LIS Publications Wiki.

Project Technologies

Flex
The technology used in creating the Journal Browser is Adobe Flex.  Flex is a technology for creating cross-platform rich internet applications which can be viewed in Adobe Flash player. Flex follows an asynchronous programming model, which relies on events being generated when an action completes or when something in the application changes.  The reason for using Flex is that it provides an easy way to create a visually-appealing application and because I am interested in learning more about the ways in which Flex can be used in libraries. While I have used Flex before, this is the first time that I have created an application, from start to finish, which addresses a real concern, so there were plenty of issues to understand.

Yahoo! Term Extraction
In constructing the content tag cloud, the Journal Browser makes use of Yahoo’s Term Extraction web service. I first became aware of this when I was creating my mash-up using Yahoo Pipes.  I again came across it when reading a description of the technology used in TagCloud.com (which is currently being revamped).  Rather than developing a method for stripping non-critical terms out of abstracts, I instead made use of the web service provided by Yahoo.

RSS / RDF / HTML / XML
The latest issue for journals comes in a variety of formats:

  • RSS/HTML. American Archivist has a latest issue RSS feed, but the entries (in particular the description) make extensive use of HTML.  Since each piece of information isn’t its own separate attribute, the HTML must be parsed to locate information of interest, making it harder to generalize the work.
  • XML/RDF. The IFLA journal has a latest issue RSS feed which uses RDF. Each piece of information is stored in a separate attribute, making it easier to extract information from the file.

While several of the journal publishers have moved toward the RDF standard, not everyone is using the same technology for publishing journal information, which was one of the challenges in implementing this project.

Challenges

There were a number of challenges that I encountered while creating this project, ranging from issues in accessing information to issues in making use of that information.  The challenges included:

  • Lack of standards. The lack of standards in the way in which publishers provide journal access means that the methods for processing that information cannot be generalized, and instead must be tailored to match each of the journals.  In some cases, multiple journals are published by the same publisher, meaning that each group needs a particular method for processing the information rather than each individual journal.  However, the wide differences in how journal information is presented means the Journal Browser must be less automated than it could be, because a new processing method must be implemented for each journal or set of journals from different publishers.
  • Cross-domain security. A cross-domain request is when a process in one domain (for example, foo.com) requests information from another domain (for example, bar.com).  Since Flex code is compiled into a SWF that can then be run in the Flash player, it must adhere to the SWF security model which does not allow for cross-domain requests from within a SWF file unless a server explicitly allows such requests. To allow such requests, a domain must have an xml file which enables access for other domains.  Unfortunately, it’s unlikely that individual Flex developers will convince companies like journal publishers to provide them with such access.  To work around this issue, one can develop a proxy program that will allow a call to your own server (where you can place an appropriate crossdomain.xml file) from your SWF; the proxy will then request the information from the other server, and return it to the SWF. However, this results in decrease performance because, in essence, two web requests must be made for each page request (application to proxy to journal server instead of application to journal server).
  • Asynchronous. As someone used to writing programs which run sequentially, it was a challenge to get used to the asynchronous model used by Flex.  In the asynchronous model, a request is carried out, and then an event is generated when the action completes or when something in the application changes.  For example, when a request is sent to download a file such as an RSS file, the program can do other things while the file downloads, then an event is generated once the file has finished downloading.  In the Journal Browser application, this was a challenge when collecting terms for the content cloud. In order to generate the cloud, the entire list of terms is needed, so tracking had to be added to determine when all of the web service calls had completed in order to then exercise the code to create the content cloud.
  • Content cloud. There are a number of ways to determine what information should be included in a content cloud.  The content cloud is intended to provide some insight into the “aboutness” of a set of information—what is the information about, and what are the key topics?  One method for doing this is to remove common words (and, the, but, among others) and then to represent the content using the remaining words.  However, by using separate words, phrases which are meant to go together may be separated.  For example, the term “information professional” would be separated into “information” and “professional”, rather than remaining together.  To address this issue, I chose to use the Yahoo Term Extraction web service.  This web service “provides a list of significant words or phrases from a larger content”.  However, there are some issues with the web service. It seems to be non-deterministic.  That is, given the same set of information, it does not necessarily return the same results each time.  However, it does keep phrases together, which is better than providing a list of the occurrences of each separate word.

Future Improvements

There are a number of future improvements which could be made to the Journal Browser application to make it both more robust and more generally usable:

  • Multiple recent issues. For RSS feeds which provide access to multiple recent issues, the Journal Browser should provide the ability to see more than the most recent issue.  The user could set a preference for the number of recent issues to display.
  • Search.  Providing the ability to search within the table of contents would allow the user to find information of interest, for example an article by a particular author or a keyword which is not one of the content tags.
  • Stemming.  The current workflow results in multiple words with the same stem being listed as content tags, but a better process would be to eliminate words that are sufficiently related as to be largely the same term. For example, “archivist” and “archivists” should not both be listed in the content tag, as those two terms represent the same topic.
  • Minimum threshold. The user should be allowed to control the minimum threshold for listing a tag in the content cloud.  For example, the user could choose to only list terms which occur at least three times in the abstracts of the issue.
  • Proxy configurability. The user should be able to configure the proxy used to access individual articles.  The user may wish to not use a proxy, instead relying only on previews of issues or on journals to which the user subscribes, or may wish to use a proxy different from the built-in proxy, which is the proxy for the King Library at San Jose State University.
  • Standard methods. The wide variety of ways in which journals provide information about most recent issues means that there is a large variability in the way the Journal Browser application parses and makes available the information.  Standards in the area of journal publishing and making journal information available online would do much to address this issue, making it easier to automate the process of parsing the journal information and making it available via the Journal Browser. Additionally, if the location for finding information about the latest issue for each journal were made available in a standard location on the LIS Publications Wiki, journal information could be automatically harvested from the wiki to be used by the Journal Browser application.
  • Performance.  The performance is somewhat slow when loading the issue information and content cloud.  This is partially due to the cross-domain issue described in the challenges section. The impact of this could be minimized somewhat if the RSS feeds were loaded and cached, rather than being loaded on demand.  This could take place while the application is mostly idle, such as when the user is looking at the images of journals on the startup screen, reading the about page, scrolling through a current issue, or viewing an article.

Conclusion

This project provided a great opportunity for me to learn about Flex technology, web standards, and web services.  Technical services in libraries are always trying to provide new ways for users to access information, and providing rich internet applications could be one way for libraries to provide access in an engaging way.  While there are a number of challenges that must be addressed, the technology is designed in a way to encourage exploration and experimentation, making it easy to create new applications that present information in more usable and accessible ways.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • StumbleUpon
  • TwitThis


  1. It‘s quiet in here! Why not leave a response?