加载中...

菜谱 9:soundex 算法


SOUNDEX 返回由四个字符组成的代码 (SOUNDEX) 以评估两个字符串的相似性。返回的第一个字符是输入字符串的第一个字符,返回的第二个字符到第四个字符是数字。

soundex 代码如下:

  1. def soundex(name, len=4):
  2. """ soundex module conforming to Knuth's algorithm
  3. implementation 2000-12-24 by Gregory Jorgensen
  4. public domain
  5. """
  6. # digits holds the soundex values for the alphabet
  7. digits = '01230120022455012623010202'
  8. sndx = ''
  9. fc = ''
  10. # translate alpha chars in name to soundex digits
  11. for c in name.upper():
  12. if c.isalpha():
  13. if not fc:
  14. fc = c # remember first letter
  15. d = digits[ord(c) - ord('A')]
  16. # duplicate consecutive soundex digits are skipped
  17. if not sndx or (d != sndx[-1]):
  18. sndx += d
  19. print sndx
  20. # replace first digit with first alpha character
  21. sndx = fc + sndx[1:]
  22. # remove all 0s from the soundex code
  23. sndx = sndx.replace('0', '')
  24. # return soundex code padded to len characters
  25. return (sndx + (len * '0'))[:len]

需要注意的是代码设计为处理英文名称。


还没有评论.