First of all, thanks for the awesome library! 💯
I am couple of doubts. It would be great help if you please answer those.
-
How is the tokenization done? Based on white space as far as I have browsed through the code. Is there a way to direct the scorer to split camel case tokens? For example: the string MyDocuments will be tokenized to ["My", " Documents"]
-
I do not see any param to direct the scorer to score in a case-insensitive manner. Is it not possible or I am missing something?
Below are the scores for couple of pair of strings, (mysmilarstring, MyawfullySimilarStirng) and (mysmilarstring, myawfullysimilarstirng). Scores are different for the pairs where as they are different only by casing of letters.
-------------------------------FuzzySharp-------------------------------------------------
mysmilarstring ||MyawfullySimilarStirng || Ratio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenSortRatio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialTokenSortRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenSetRatio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialTokenSetRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenInitialismRatio = 0
mysmilarstring ||MyawfullySimilarStirng || PartialTokenInitialismRatio = 0
mysmilarstring ||MyawfullySimilarStirng || WeightedRatio = 64
-------------------------------FuzzySharp-------------------------------------------------
mysmilarstring ||myawfullysimilarstirng || Ratio = 72
mysmilarstring ||myawfullysimilarstirng || PartialRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenSortRatio = 72
mysmilarstring ||myawfullysimilarstirng || PartialTokenSortRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenSetRatio = 72
mysmilarstring ||myawfullysimilarstirng || PartialTokenSetRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenInitialismRatio = 0
mysmilarstring ||myawfullysimilarstirng || PartialTokenInitialismRatio = 0
mysmilarstring ||myawfullysimilarstirng || WeightedRatio = 77
First of all, thanks for the awesome library! 💯
I am couple of doubts. It would be great help if you please answer those.
How is the tokenization done? Based on white space as far as I have browsed through the code. Is there a way to direct the scorer to split camel case tokens? For example: the string
MyDocumentswill be tokenized to["My", " Documents"]I do not see any param to direct the scorer to score in a case-insensitive manner. Is it not possible or I am missing something?
Below are the scores for couple of pair of strings,
(mysmilarstring, MyawfullySimilarStirng)and(mysmilarstring, myawfullysimilarstirng). Scores are different for the pairs where as they are different only by casing of letters.-------------------------------FuzzySharp-------------------------------------------------
mysmilarstring ||MyawfullySimilarStirng || Ratio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenSortRatio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialTokenSortRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenSetRatio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialTokenSetRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenInitialismRatio = 0
mysmilarstring ||MyawfullySimilarStirng || PartialTokenInitialismRatio = 0
mysmilarstring ||MyawfullySimilarStirng || WeightedRatio = 64
-------------------------------FuzzySharp-------------------------------------------------
mysmilarstring ||myawfullysimilarstirng || Ratio = 72
mysmilarstring ||myawfullysimilarstirng || PartialRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenSortRatio = 72
mysmilarstring ||myawfullysimilarstirng || PartialTokenSortRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenSetRatio = 72
mysmilarstring ||myawfullysimilarstirng || PartialTokenSetRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenInitialismRatio = 0
mysmilarstring ||myawfullysimilarstirng || PartialTokenInitialismRatio = 0
mysmilarstring ||myawfullysimilarstirng || WeightedRatio = 77