Graphs Aiding Exploratory Search

by Jakub Šimko

This project was supervised by Michal Tvarožek


Purpose of this project was twosome. Specific goal of our work was to improve speed and effectiveness of the web resource revisitation (i.e. visiting resources that were previously discovered by user in the past). In that course, we designed a method, called Search History Tree (SHT) and History Map (HM), for processing of the recorded user web search actions and transforming them to graph structures. Rather than recommencing the search, with use of these structures, user can perform lookup in his own search history faster.

In more general manner, we wanted to enrich the area of the Exploratory Search and semantic web techniques. Therefore, several techniques used by our history tool were designed with respect to their reusability and were validated separately. The first one is session identification engine used for discovering search sessions based on user’s agenda (approach useful in almost any similar history-processing application). The second one is quite non-conventional way of discovering of semantic networks of terms by mining of the logs of simple web search game.

Problem area

We identified the problem of search repetition during web revisitation. In other words, user is more likely to perform costly (in time and effort) web search, even when he is searching for a resource that he discovered before. This happens due to inefficiency of two major approaches dealing with revisitation support in today’s browsers: bookmarks and browsing history.

Out of all web resource revisitation cases (when we exclude frequently visited pages or pages where user knows their exact address) in only few cases the user can rely on his bookmark set. This is simply because of the extreme effort user would have to invest into proper organization and upkeep of his bookmark space. He would have to decide whether he will need that particular page he just found in the future (which is arguably the last thing the user thinks about while he is searching for information on the web). Secondly, he would have put effort to organize huge amount of references (that also discourages users to bookmark more often) and thirdly, he would have to perform regular cleanup to remove those references that appeared to be of no future use. Although the problem of organization of bookmarks have been solved by bookmarking portals like delicious, bookmark adding and bookmark removing decisions remain an open problem.

On the other hand, automatically recorded browsing history appears to address the “add bookmark” decision problem – it simply records reference to each visited resource. Problem here is the inefficiency of lookup. In current browser standard, if user is unable to recall the name or time when he visited the resource he seeks; scrolling the endless history list is just frustrating. Several projects tried to solve this issue by full text indexing the visited results and enabling users to search by keywords. In our solution, we also adopted that approach, but we pushed it even further.

We paid extra attention to exploratory search on both general and semantic web. Exploratory search is a search paradigm which focuses on open-ended search tasks, especially involving learning and investigating tasks as opposite to classic perception of the search as a retrieval of facts or specific resources. The biggest problem during exploratory search tasks is the so called “information space invisibility”, which means that user has to be initiative in formulating of query. He has to guess query keywords instead of picking them from the suggestion list. Visualization of the information space means offering user some kind of visible abstraction of that space, in which he can navigate and zoom. Exploratory search support applications often rely on advanced graphic tools like graphs, maps or chartings.

Solution overview

Solution overview

Solution overview.

We designed a method for web search history processing. Its purpose is to increase speed and success rate of the web revisitation process. The method automatically records user’s web search and web browsing actions, processes them and creates visual graphs representing user’s history and interests.

Search History Tree (SHT)

Actions of a search session are recorded into tree structure. Core of the tree structure (non-leaf nodes) consists of queries used during the search. The root node of the tree represents initial query. Queries are interconnected based on their order during query reformulation. Usage of “back” button during web searches results into creating multiple search branches. Web documents, visited during the search, are also present in the tree in the form of leaf nodes, attached to their respective queries.

search history tree

Schematic example of a SHT.

The tree is continuously created during the process of web browsing and is visible to the user all the time. This helps user to see his achieved results at a glance and offers guidance during complex sessions (user can interact with the tree and reload one of the previous states of the search).

However, meaningful SHTs have to contain only queries and documents related to single user need. Therefore, we need to group search actions according to different user needs: users often perform multiple searches with different goals at the same time, a single user need may span over multiple instances of search application, user may use same instance for different tasks. Therefore creating history tree out of logs of each application instance is senseless – we have to analyze semantics of the search, which is done by session identification module.

search history tree screen

SHT embedded within the faceted browser Factic.

Searching in the history

Each search history tree is stored. A full text index is created based on extracted keywords of each past search action (query or document visit) as well as from whole trees using aggregated tree characteristics. User is afterwards enabled to query the index and search items visited in the past. The search in the history is in fact commenced each time user performs regular web search (as the history module is integrated with search application). Results of the history search are documents, but with visualization of their original context in which they were retrieved. This enables user to make quick decision about the relevance of the object. If user is searching for something in his own history, he may or may not recognize a web document by name, URL or snippet, but additional context like his past query which preceded the result in the past can also make difference between “still confused” and “this was it”.

History Map

Ultimately, to organize user past search experience, we create History Map (HM) – a graph of web documents, queries and terms. The graph is synthesized out of existing history trees. Trees are reorganized a bit: the time order they represent is replaced with term composition hierarchy. We connect trees by merging identical query and document nodes. We then add special, single term nodes and connect them to the existing graph (which also creates connections between former trees). To create even more connections, we outsource the Delicious folksonomy and add more single-term nodes related to those existing. As an alternative folksonomy, we devised a special Little Google Game term network.

history map sample

Example of the history map.

history map sample screen

Piece of a History Map made of logs recorded by Adaptive Proxy. Visited document is coloured with yellow, preceding queries with cyan. Blue are directly, or indirectly related terms. Red is used for describing domain related to the document.