28.01.2013 Views

SAP HANA Developer Guide - Get a Free Blog

SAP HANA Developer Guide - Get a Free Blog

SAP HANA Developer Guide - Get a Free Blog

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Fuzzy search can be used in various applications, for example:<br />

● Fault-tolerant search in text columns (for example, html or pdf): Search for documents on 'Driethanolamyn'<br />

and find all documents that contain the term 'Triethanolamine'.<br />

● Fault-tolerant search in structured database content: Search for a product called 'coffe krisp biscuit' and find<br />

'Toffee Crisp Biscuits'.<br />

● Fault-tolerant check for duplicate records: Before creating a new customer record in a CRM system, search for<br />

similar customer records and verify that there are no duplicates already stored in the system. When, for<br />

example, creating a new record 'SAB Aktiengesellschaft & Co KG Deutschl.' in 'Wahldorf', the system shall<br />

bring up '<strong>SAP</strong> Deutschland AG & Co. KG' in 'Walldorf' as a possible duplicate.<br />

You can call the fuzzy search by using the CONTAINS predicate with the FUZZY option in the WHERE clause of a<br />

SELECT statement.<br />

Fuzzy Score<br />

SELECT * FROM <br />

WHERE CONTAINS (, , FUZZY (0.8))<br />

Note: You can improve the performance of a fuzzy search on a text column by defining a fuzzy<br />

index at the column. You can define a fuzzy index with the option FUZZY SEARCH INDEX ON in<br />

the CREATE FULLTEXT INDEX statement or in the definition of data type TEXT.<br />

The fuzzy search algorithm calculates a fuzzy score for each string comparison. The higher the score, the more<br />

similar the strings are. A score of 1.0 means the strings are identical. A score of 0.0 means the strings have<br />

nothing in common.<br />

You can request the score in the SELECT statement by using the SCORE() function. You can sort the results of a<br />

query by score in descending order to get the best records first (the best record is the record that is most similar<br />

to the user input). When a fuzzy search of multiple columns is used in a SELECT statement, the score is returned<br />

as an average of the scores of all columns used.<br />

When searching text columns, a TF/IDF (term frequency/inverse document frequency) score is returned by<br />

default instead of the fuzzy score. The fuzzy score influences the TF/IDF calculation, but it is important to keep in<br />

mind that, with TF/IDF, the range of the score values returned is normed to the interval between 0.0 and 1.0, and<br />

the best record always gets a score of 1.0, regardless of its fuzzy score.<br />

The TF/IDF calculation can be disabled so that you get the fuzzy score instead. In particular, this makes sense for<br />

short-text columns containing data such as product names or company names. On the other hand, you should<br />

use TF/IDF for long-text columns containing data such as product descriptions, HTML data, or Word and PDF<br />

documents.<br />

Option spellCheckFactor<br />

There are two use cases for the option spellCheckFactor.<br />

● A) This option allows you to set the score for terms that are not fully equal but that would be a 100% match<br />

because of the internal character standardization used by the fuzzy search.<br />

For example, the terms 'Café' and 'cafe' give a score of 1.0 although the terms are not equal. For some users it<br />

may be necessary to distinguish between both terms.<br />

<strong>SAP</strong> <strong>HANA</strong> <strong>Developer</strong> <strong>Guide</strong><br />

Enabling Search<br />

P U B L I C<br />

© 2012 <strong>SAP</strong> AG. All rights reserved. 257

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!