Text Indexes ============ Text indexes combine an inverted index and a lexicon to support text indexing and searching. A text index can be created without passing any arguments: >>> from zope.index.text.textindex import TextIndex >>> index = TextIndex() By default, it uses an "Okapi" inverted index and a lexicon with a pipeline consistening is a simple word splitter, a case normalizer, and a stop-word remover. We index text using the `index_doc` method: >>> index.index_doc(1, u"the quick brown fox jumps over the lazy dog") >>> index.index_doc(2, ... u"the brown fox and the yellow fox don't need the retriever") >>> index.index_doc(3, u""" ... The Conservation Pledge ... ======================= ... ... I give my pledge, as an American, to save, and faithfully ... to defend from waste, the natural resources of my Country; ... it's soils, minerals, forests, waters and wildlife. ... """) >>> index.index_doc(4, u"Fran\xe7ois") >>> word = ( ... u"\N{GREEK SMALL LETTER DELTA}" ... u"\N{GREEK SMALL LETTER EPSILON}" ... u"\N{GREEK SMALL LETTER LAMDA}" ... u"\N{GREEK SMALL LETTER TAU}" ... u"\N{GREEK SMALL LETTER ALPHA}" ... ) >>> index.index_doc(5, word + u"\N{EM DASH}\N{GREEK SMALL LETTER ALPHA}") >>> index.index_doc(6, u""" ... What we have here, is a failure to communicate. ... """) >>> index.index_doc(7, u""" ... Hold on to your butts! ... """) >>> index.index_doc(8, u""" ... The Zen of Python, by Tim Peters ... ... Beautiful is better than ugly. ... Explicit is better than implicit. ... Simple is better than complex. ... Complex is better than complicated. ... Flat is better than nested. ... Sparse is better than dense. ... Readability counts. ... Special cases aren't special enough to break the rules. ... Although practicality beats purity. ... Errors should never pass silently. ... Unless explicitly silenced. ... In the face of ambiguity, refuse the temptation to guess. ... There should be one-- and preferably only one --obvious way to do it. ... Although that way may not be obvious at first unless you're Dutch. ... Now is better than never. ... Although never is often better than *right* now. ... If the implementation is hard to explain, it's a bad idea. ... If the implementation is easy to explain, it may be a good idea. ... Namespaces are one honking great idea -- let's do more of those! ... """) Then we can search using the apply method, which takes a search string: >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown fox').items()] [(1, '0.6153'), (2, '0.6734')] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'quick fox').items()] [(1, '0.6153')] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown python').items()] [] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'dalmatian').items()] [] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown or python').items()] [(1, '0.2602'), (2, '0.2529'), (8, '0.0934')] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'butts').items()] [(7, '0.6948')] The outputs are mappings from document ids to integer scored. Items with higher scores are more relevent. We can use unicode characters in search strings: >>> [(k, "%.4f" % v) for (k, v) in index.apply(u"Fran\xe7ois").items()] [(4, '0.7427')] >>> [(k, "%.4f" % v) for (k, v) in index.apply(word).items()] [(5, '0.7179')] We can use globbing in search strings: >>> [(k, "%.3f" % v) for (k, v) in index.apply('fo*').items()] [(1, '2.179'), (2, '2.651'), (3, '2.041')] Text indexes support basic statistics: >>> index.documentCount() 8 >>> index.wordCount() 114