Instead of displaying blue links to the top relevant documents like traditional search engines, we retrieve the top-k relevant entities.
Then we display the top-k entities as a integrated structural graph of correlated entities with the advanced visualization techniques.
Our system can bring the previously isolated entities into one integrated, interlinked and entity graph, upon which users can issue queries beyond simply typing keywords.
This is much more intuitive and attractive than the simple top-k blue links, and bring more meaningful structural results with correlated entities.
There are 5 steps:
1. Top 100 documents retrieval (Currently using traditional Search Engine API. But we do can maintain the index of a huge set of documents on our sever to improve efficiency.)
2. Noisy Removal (ads, navigation bar etc., our own algorithm)
2. Entity Extraction (NLP, Stanford NER open source tool)
3. Entity Disambiguation (say merge "Bill Clinton" and "William Clinton", currently using some simple rules)
4. Top-k relevant entity retrieval and their similarity score calculation (our own algorithm)
5. Visualization (Energy Minimization Algorithm)
Front-end: Action Script 3 (Flash)
Back-end: Java (talking to font-end by sending XML)
Note: Currently if the query is not yet in the cache, it may take up to 1 min to get the response. (So please be patient!)
This is because we do all the thing at query time (Document crawling and entity extraction is a bit time consuming). However if we maintain the index on the server-side (and also extract the entities beforehand), the query response time can be minimum.