I need to compare two strings and get a confidence score on how similar they are. I have around half a million of such strings to compare.While Fuzzy look up sample codes I used compare a string against all the records in the reference table,my comparison requirement is limited to two given strings.
Can some body please help me on how to do this using fuzzy look up?
Thank you very much!Do you want to compare all the source strings with one string in the
lookup table? If that is the case then can you not use a view of the
one reference column as the lookup and not the table object? If I have
misinterpreted the requirement then please let me know.
Allan
"Ashraj@.discussions.microsoft.com"
wrote in message
news:046a46c8-8fa8-4399-9417-24493b01d9d0@.discussions.microsoft.com:
> I need to compare two strings and get a confidence score on how similar
> they are. I have around half a million of such strings to compare.While
> Fuzzy look up sample codes I used compare a string against all the
> records in the reference table,my comparison requirement is limited to
> two given strings.
>
>
> Can some body please help me on how to do this using fuzzy look up?
> Thank you very much!|||Hi,
Thanks much for your input.Sorry I wasn't clear.
Actually, I have two files . FILEA and FILEB. They have have similar structure with two fields Accountnumber(unique key) and Address(String value) . FileB is the reference file.FileA is input file.Both have same number of records with matching key fields.
For each account number in FILEA,the Address string has to be compared to address string in FILEB with the same account number.
For example Address in first row "45 Sunset blvd" with acct number 111 ,needs to be compared with address of account 111 in FILEB which is "456 Sunset Blvd " and a confidance and similarity score needs to be registered .
The Key field account number is only for linking purpose which may not be required if I have a way to pass two strings and get the comparison results.
FileA
Acct
Address
111
45 Sunset blvd
555
236 technical str
999
448
FileB
Acct
Address
111
456 Sunset Blvd
555
236 Technical st.
999
449
Looking forward to any help that can be provided.Please feel free to ask for any clarifications.Thanks much|||Ok So I have this
CREATE TABLE AccountsInput(AcctNUm int, Street varchar(50))
CREATE TABLE AccountsLookup(AcctNUm int, Street varchar(50))
Insert AccountsInput VALUES(111,'45 Sunset Boulevard')
Insert AccountsInput VALUES(555,'236 technical str')
Insert AccountsInput VALUES(999,'448 Madeira')
Insert AccountsLookup VALUES(111,'45 Sunset BLVD')
Insert AccountsLookup VALUES(555,'236 technical st.')
Insert AccountsLookup VALUES(999,'448 Madera')
I want to compare both the acct# and the address to get matches or not.
I configure the fuzzy lookup to match on both the attributes and I get
this.
AcctNUmStreetLooked Up account NumLooked Up Street
Name_Similarity_Confidence
11145 Sunset Boulevard11145 Sunset BLVD0.74537040.9875
555236 technical str555236 technical st.0.88491890.9875
999448 Madeira999448 Madera0.92811570.9875
Is this not what you want ?
Allan
"Ashraj@.discussions.microsoft.com"
wrote in message
news:74b34ed7-c2df-4569-b8bd-1d5e4ed19b4f@.discussions.microsoft.com:
> Hi,
> Thanks much for your input.Sorry I wasn't clear.
>
> Actually, I have two files . FILEA and FILEB. They have have similar
> structure with two fields Accountnumber(unique key) and Address(String
> value) . FileB is the reference file.FileA is input file.Both have same
> number of records with matching key fields.
>
> For each account number in FILEA,the Address string has to be compared
> to address string in FILEB with the same account number.
>
>
> For example Address in first row "45 Sunset blvd" with acct number 111
> ,needs to be compared with address of account 111 in FILEB which is
> "456 Sunset Blvd " and a confidance and similarity score needs to be
> registered .
>
> The Key field account number is only for linking purpose which may not
> be required if I have a way to pass two strings and get the comparison
> results.
>
>
>
> FileA
>
>
>
> Acct
>
> Address
>
> 111
>
> 45 Sunset blvd
>
> 555
>
> 236 technical str
>
> 999
>
> 448 madera
>
>
>
>
>
> FileB
>
>
>
> Acct
>
> Address
>
> 111
>
> 456 Sunset Blvd
>
> 555
>
> 236 Technical st.
>
> 999
>
> 449 Madera
>
>
>
> Looking forward to any help that can be provided.Please feel free to ask
> for any clarifications.Thanks much|||Thanks Alan. It looks almost perfect .
I just wanted to confirm one more thing before I say this is what I exactly want.
For our comparison, I have the following basic assumption that
a) account numbers are not corrupt and all account numbers in FILEA are in FILEB and viceversa.
b) FILEA and FILEB are unique keyed with account number
As you had detailed, I want address of account number 111 in FILEA to be compared to address of account number 111 ONLY in FILEB.
The address comparison should be made ONLY between the records that have same account number in FILEA and FILEB .Comparison and statistics should not be made between address that don't have common account number.
For example, Address of account number 111 SHOULD NOT be compared with address of account number 555 and 999.
In our example for first record it SHOULD NOT execute 3 searches and give the following.
111 45 Sunset Boulevard 111 45 Sunset BLVD 0.7453704 0.9875
--
111 45 Sunset Boulevard 555 236 technical st. 0.1xxxxxx 0.xxxxx
111 45 Sunset Boulevard 999 448 Madera 0.9281157 0.9875
It should do ONLY do comparison one time and give only the following:-
111 45 Sunset Boulevard 111 45 Sunset BLVD 0.7453704 0.9875
Since account numbers are assumed to be clean and same in both files,if we don't include the account number match similarity and confidence and have the confidence and similarity for the address alone that would be perfect.
So the idea is basically to compare the two addresses .Account number only helps to link the input file address with the reference file address.Instead of account number,if we can compare the address sequentially (like first record of input file to be compared with first record of reference only) then that is fine too.
This way,each address of FILEA need NOT be compared with each address of reference file which will be huge system consuming and is also not required.
Thanks for your patience and please do feel free to make me clarify further.
|||The CROSS PRODUCT matching you describe will not occur or at least it
will not be visible to you I think. Thinking about this I do not know
if the engine would consider matching (It is fuzzy after all) but
because of the so way out probability and confidence scores it would not
get further than a distant thought.
To that end I changed the input data slightly
CREATE TABLE AccountsInput(AcctNUm int, Street varchar(50))
CREATE TABLE AccountsLookup(AcctNUm int, Street varchar(50))
Insert AccountsInput VALUES(111,'45 Sunset Boulevard')
Insert AccountsInput VALUES(555,'236 technical str')
Insert AccountsInput VALUES(999,'448 Madeira')
Insert AccountsLookup VALUES(111,'45 Sunset BLVD')
Insert AccountsLookup VALUES(555,'236 technical st.')
Insert AccountsLookup VALUES(999,'448 Madera')
Insert AccountsLookup VALUES(996,'442 Madeira')--added very close to 999
I still only matched on the same rows.
Allam
"Ashraj@.discussions.microsoft.com"
wrote in message
news:5b707de0-1b00-40d3-9a4b-4868791a245f@.discussions.microsoft.com:
> Thanks Alan. It looks almost perfect .
> I just wanted to confirm one more thing before I say this is what I
> exactly want.
>
> For our comparison, I have the following basic assumption that
> a) account numbers are not corrupt and all account numbers in FILEA are
> in FILEB and viceversa.
> b) FILEA and FILEB are unique keyed with account number
>
> As you had detailed, I want address of account number 111 in FILEA to
> be compared to address of account number 111 ONLY in FILEB.
>
> The address comparison should be made ONLY between the records that have
> same account number in FILEA and FILEB .Comparison and statistics should
> not be made between address that don't have common account number.
>
>
> For example, Address of account number 111 SHOULD NOT be compared with
> address of account number 555 and 999.
>
> In our example for first record it SHOULD NOT execute 3 searches and
> give the following.
> 111 45 Sunset Boulevard 111 45 Sunset BLVD 0.7453704
> 0.9875
>
> --
> 111 45 Sunset Boulevard 555 236 technical st. 0.1xxxxxx
> 0.xxxxx
>
>
> 111 45 Sunset Boulevard 999 448 Madera 0.9281157
> 0.9875
>
> It should do ONLY do comparison one time and give only the following:-
>
> 111 45 Sunset Boulevard 111 45 Sunset BLVD 0.7453704 0.9875
>
>
>
>
>
>
> Since account numbers are assumed to be clean and same in both files,if
> we don't include the account number match similarity and confidence and
> have the confidence and similarity for the address alone that would be
> perfect.
>
> So the idea is basically to compare the two addresses .Account number
> only helps to link the input file address with the reference file
> address.Instead of account number,if we can compare the address
> sequentially (like first record of input file to be compared with first
> record of reference only) then that is fine too.
>
> This way,each address of FILEA need NOT be compared with each address of
> reference file which will be huge system consuming and is also not
> required.
>
>
> Thanks for your patience and please do feel free to make me clarify
> further.|||Thank you much. Okay.let me put it in a slightly different way to address the issue,since assuming SSIS internally does the different combinations but displays only one ,would mean still again half million look ups for each of the records in FILEA.
Let us have both Mailing address and property address in one file as below.
This will be the only input data file .
Acct
Mail Address
Prop Address
111
45 Sunset blvd
45 Sunst Boulvrd
555
236 technical str
999
448
448 Made a
Mail Address Prop Address Similarity Confidence 45 Sunset blvd 45 Sunst Boulvrd 0.xxxxxxx o.xxxxxxxx 236 technical str 0.xxxxxxxx 0.xxxxxxxx 448 448 Made a 0.xxxxxxx 0.xxxxxxx
as below.
Just one file that has both the reference field and look up field.
No comments:
Post a Comment