Working for a legal publisher, we face many challenges related to content relations and keeping content relevant. In legal content, citations set precedence for legal professionals to further relate cases or understand rulings on cases. I have been formulating the concept of a probably well known issue, known as “inverse citation frequency”. The principle follows that of most search engines that use inward links to a document as a mechanism for scoring the relevancy of a document. Given the number of citations found within a document, one would relate these to other documents that share the same citation or group of citations including an element of the sentiment of the cases ruling. The identification and normalization of citations would drastically improve the cross-linking of news stories, cases to cases, etc.
The key issue is normalization of the case citiations, while Blue book and Chicago Law,NY Style Manual have style guidelines for formatting citations, there are many permutations of how people express citations. I have spent many hours handcrafting citation regular expressions and have found it to be a non-trivial exercise. Sure companies like Lexis and West have mastered this functionality in product lines, but these systems are locked behind their proprietary walls.
Anybody have any thoughts on this??