MySQL: fuzzy searching of fulltext queries

Introduction:
Whenever a user-search on a constrained set of rows comes up empty, it is desirable, from a user-experience point-of-view, to re-perform the search in a fuzzy manner. That is, allowing the search-term to match with a certain error and bias. After all, users have grown accustomed to technology that assists rather than hinders.
Searching for the German word Kompetenz, instantly returns a table-result.
Performing a google-search for "mysql fuzzy search", brought up this stackoverflow question: How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete? Among the results were also similar, yet unfinished attempts.
Frequently pointed out is the  Levenshtein distance, which in combination with a certain cutoff score, provides a great way of fuzzy-matching strings. Unfortunately, MySQL >5.x does not provide a Levenshtein function. There is the option of stored procedures, even compiling libs and binding them at runtime. Procedures will require the database-user to have the necessary GRANT PRIVILEGES ON USER, and on cheap or free LAMP (Linux Apache MySql PHP) hosts, this is generally not never the case.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.


MySQL does however provide a SOUNDEX function natively, which can return a result not limited in length, based upon the original Soundex implementation developed by Robert C. Russell and Margaret K. Odell - and patented in 1918, almost a century ago. The original implementation discards vowels first and duplicates second. It is generally the other way around in subsequent implementations. Examples for MySQL SOUNDEX are provided below.

The Soundex algorithm in brief:

  • Retain the first letter of a word
  • Replace similar sounding sets of consonants through digits. sets are numbered between 1 and 6 e.g. {b, f, p, v} := 1
  • Include English language idiosyncrasies: e.g. h, w between the same digits, retains the first digit only 
  • Stops after a maximum of three digits (This is not the case in MySQL Soundex. but true for PHP Soundex)

Implementation



Description:

SUBSTRING is used to discard the first letter, thus constraining the search to numbers only. Using numbers only can be used to one's advantage when speed is of the essence, by storing a NUMERIC field in a temporary table or as an extra field in the data-table, which only contains SOUNDEXed digits of the original text-field, that is to be searched. MySQL may then optimize the search.

CONCAT  is used to concatenate strings, as MySQL lacks a string concatenation operator. The output of CONCAT serves as the input for LIKE function. The search term, which was reduced to a SOUNDEX number, is attempted to be matched anywhere within the SOUNDEX-reduced text field.
As such the algorithm is fairly easy and does have some quirks: Unlike natural language, which is separated by spaces into words and thus serve as a reference frame, the same does not hold true in this scenario. In the algorithm described all spaces are disregarded. Additionally SOUNDEX is optimized for the English language, meaning that consonants that sound similar in English are grouped and assigned a number and that Unicode language-alphabets are generally not supported.

Nonetheless, SOUNDEX is fast, and the algorithm works fairly well for typical scenarios such as small tables in Customer Relationship Management and Content Management Systems. An example can be seen in the screenshot, showing a result being returned in English despite the similar sounding search term being in German.


Notes:

  • SOUNDEX results often have a character size of four, since everything beyond three digits is cut off.
  • MySQL 5 brings with it a Fuzzy searching function called MATCH AGAINST. The function allows weights to be assigned to words, and summarized scores to be ultimately ordered by relevance of a given Fulltext field entry. A FULLTEXT index is advisable, which rules out using MATCH AGAINST out for large, monolithic databases.
  • The presented scheme through SOUNDEX and LIKE is already Map-Reduce-able.


LihatTutupKomentar