Wednesday, March 7, 2012

I need to compare two string values and give confidence score on how much they are similar.How d


I need to compare two strings and get a confidence score on how similar they are. I have around half a million of such strings to compare.While Fuzzy look up sample codes I used compare a string against all the records in the reference table,my comparison requirement is limited to two given strings.
Can some body please help me on how to do this using fuzzy look up?
Thank you very much!Do you want to compare all the source strings with one string in the

lookup table? If that is the case then can you not use a view of the

one reference column as the lookup and not the table object? If I have

misinterpreted the requirement then please let me know.

Allan

"Ashraj@.discussions.microsoft.com"

wrote in message

news:046a46c8-8fa8-4399-9417-24493b01d9d0@.discussions.microsoft.com:

> I need to compare two strings and get a confidence score on how similar

> they are. I have around half a million of such strings to compare.While

> Fuzzy look up sample codes I used compare a string against all the

> records in the reference table,my comparison requirement is limited to

> two given strings.

>

>

> Can some body please help me on how to do this using fuzzy look up?

> Thank you very much!|||Hi,
Thanks much for your input.Sorry I wasn't clear.

Actually, I have two files . FILEA and FILEB. They have have similar structure with two fields Accountnumber(unique key) and Address(String value) . FileB is the reference file.FileA is input file.Both have same number of records with matching key fields.

For each account number in FILEA,the Address string has to be compared to address string in FILEB with the same account number.

For example Address in first row "45 Sunset blvd" with acct number 111 ,needs to be compared with address of account 111 in FILEB which is "456 Sunset Blvd " and a confidance and similarity score needs to be registered .

The Key field account number is only for linking purpose which may not be required if I have a way to pass two strings and get the comparison results.

FileA

Acct

Address

111

45 Sunset blvd

555

236 technical str

999

448 madera

FileB

Acct

Address

111

456 Sunset Blvd

555

236 Technical st.

999

449 Madera

Looking forward to any help that can be provided.Please feel free to ask for any clarifications.Thanks much|||Ok So I have this

CREATE TABLE AccountsInput(AcctNUm int, Street varchar(50))

CREATE TABLE AccountsLookup(AcctNUm int, Street varchar(50))

Insert AccountsInput VALUES(111,'45 Sunset Boulevard')

Insert AccountsInput VALUES(555,'236 technical str')

Insert AccountsInput VALUES(999,'448 Madeira')

Insert AccountsLookup VALUES(111,'45 Sunset BLVD')

Insert AccountsLookup VALUES(555,'236 technical st.')

Insert AccountsLookup VALUES(999,'448 Madera')

I want to compare both the acct# and the address to get matches or not.

I configure the fuzzy lookup to match on both the attributes and I get

this.

AcctNUmStreetLooked Up account NumLooked Up Street

Name_Similarity_Confidence

11145 Sunset Boulevard11145 Sunset BLVD0.74537040.9875

555236 technical str555236 technical st.0.88491890.9875

999448 Madeira999448 Madera0.92811570.9875

Is this not what you want ?

Allan

"Ashraj@.discussions.microsoft.com"

wrote in message

news:74b34ed7-c2df-4569-b8bd-1d5e4ed19b4f@.discussions.microsoft.com:

> Hi,

> Thanks much for your input.Sorry I wasn't clear.

>

> Actually, I have two files . FILEA and FILEB. They have have similar

> structure with two fields Accountnumber(unique key) and Address(String

> value) . FileB is the reference file.FileA is input file.Both have same

> number of records with matching key fields.

>

> For each account number in FILEA,the Address string has to be compared

> to address string in FILEB with the same account number.

>

>

> For example Address in first row "45 Sunset blvd" with acct number 111

> ,needs to be compared with address of account 111 in FILEB which is

> "456 Sunset Blvd " and a confidance and similarity score needs to be

> registered .

>

> The Key field account number is only for linking purpose which may not

> be required if I have a way to pass two strings and get the comparison

> results.

>

>

>

> FileA

>

>

>

> Acct

>

> Address

>

> 111

>

> 45 Sunset blvd

>

> 555

>

> 236 technical str

>

> 999

>

> 448 madera

>

>

>

>

>

> FileB

>

>

>

> Acct

>

> Address

>

> 111

>

> 456 Sunset Blvd

>

> 555

>

> 236 Technical st.

>

> 999

>

> 449 Madera

>

>

>

> Looking forward to any help that can be provided.Please feel free to ask

> for any clarifications.Thanks much|||Thanks Alan. It looks almost perfect .
I just wanted to confirm one more thing before I say this is what I exactly want.

For our comparison, I have the following basic assumption that
a) account numbers are not corrupt and all account numbers in FILEA are in FILEB and viceversa.
b) FILEA and FILEB are unique keyed with account number

As you had detailed, I want address of account number 111 in FILEA to be compared to address of account number 111 ONLY in FILEB.

The address comparison should be made ONLY between the records that have same account number in FILEA and FILEB .Comparison and statistics should not be made between address that don't have common account number.
For example, Address of account number 111 SHOULD NOT be compared with address of account number 555 and 999.

In our example for first record it SHOULD NOT execute 3 searches and give the following.
111 45 Sunset Boulevard 111 45 Sunset BLVD 0.7453704 0.9875
--
111 45 Sunset Boulevard 555 236 technical st. 0.1xxxxxx 0.xxxxx

111 45 Sunset Boulevard 999 448 Madera 0.9281157 0.9875

It should do ONLY do comparison one time and give only the following:-

111 45 Sunset Boulevard 111 45 Sunset BLVD 0.7453704 0.9875

Since account numbers are assumed to be clean and same in both files,if we don't include the account number match similarity and confidence and have the confidence and similarity for the address alone that would be perfect.

So the idea is basically to compare the two addresses .Account number only helps to link the input file address with the reference file address.Instead of account number,if we can compare the address sequentially (like first record of input file to be compared with first record of reference only) then that is fine too.

This way,each address of FILEA need NOT be compared with each address of reference file which will be huge system consuming and is also not required.
Thanks for your patience and please do feel free to make me clarify further.

|||The CROSS PRODUCT matching you describe will not occur or at least it

will not be visible to you I think. Thinking about this I do not know

if the engine would consider matching (It is fuzzy after all) but

because of the so way out probability and confidence scores it would not

get further than a distant thought.

To that end I changed the input data slightly

CREATE TABLE AccountsInput(AcctNUm int, Street varchar(50))

CREATE TABLE AccountsLookup(AcctNUm int, Street varchar(50))

Insert AccountsInput VALUES(111,'45 Sunset Boulevard')

Insert AccountsInput VALUES(555,'236 technical str')

Insert AccountsInput VALUES(999,'448 Madeira')

Insert AccountsLookup VALUES(111,'45 Sunset BLVD')

Insert AccountsLookup VALUES(555,'236 technical st.')

Insert AccountsLookup VALUES(999,'448 Madera')

Insert AccountsLookup VALUES(996,'442 Madeira')--added very close to 999

I still only matched on the same rows.

Allam

"Ashraj@.discussions.microsoft.com"

wrote in message

news:5b707de0-1b00-40d3-9a4b-4868791a245f@.discussions.microsoft.com:

> Thanks Alan. It looks almost perfect .

> I just wanted to confirm one more thing before I say this is what I

> exactly want.

>

> For our comparison, I have the following basic assumption that

> a) account numbers are not corrupt and all account numbers in FILEA are

> in FILEB and viceversa.

> b) FILEA and FILEB are unique keyed with account number

>

> As you had detailed, I want address of account number 111 in FILEA to

> be compared to address of account number 111 ONLY in FILEB.

>

> The address comparison should be made ONLY between the records that have

> same account number in FILEA and FILEB .Comparison and statistics should

> not be made between address that don't have common account number.

>

>

> For example, Address of account number 111 SHOULD NOT be compared with

> address of account number 555 and 999.

>

> In our example for first record it SHOULD NOT execute 3 searches and

> give the following.

> 111 45 Sunset Boulevard 111 45 Sunset BLVD 0.7453704

> 0.9875

>

> --

> 111 45 Sunset Boulevard 555 236 technical st. 0.1xxxxxx

> 0.xxxxx

>

>

> 111 45 Sunset Boulevard 999 448 Madera 0.9281157

> 0.9875

>

> It should do ONLY do comparison one time and give only the following:-

>

> 111 45 Sunset Boulevard 111 45 Sunset BLVD 0.7453704 0.9875

>

>

>

>

>

>

> Since account numbers are assumed to be clean and same in both files,if

> we don't include the account number match similarity and confidence and

> have the confidence and similarity for the address alone that would be

> perfect.

>

> So the idea is basically to compare the two addresses .Account number

> only helps to link the input file address with the reference file

> address.Instead of account number,if we can compare the address

> sequentially (like first record of input file to be compared with first

> record of reference only) then that is fine too.

>

> This way,each address of FILEA need NOT be compared with each address of

> reference file which will be huge system consuming and is also not

> required.

>

>

> Thanks for your patience and please do feel free to make me clarify

> further.|||Thank you much. Okay.let me put it in a slightly different way to address the issue,since assuming SSIS internally does the different combinations but displays only one ,would mean still again half million look ups for each of the records in FILEA.

Let us have both Mailing address and property address in one file as below.
This will be the only input data file .

Acct

Mail Address

Prop Address

111

45 Sunset blvd

45 Sunst Boulvrd

555

236 technical str

23 Technic street

999

448 madera

448 Made a

We want to do is Mailing address and prop address strings for each row and give the output as ACCT#,Mail Address,Prop Address ,Similarity,Confidance
as below.

Mail Address

Prop Address

Similarity

Confidence

45 Sunset blvd

45 Sunst Boulvrd

0.xxxxxxx

o.xxxxxxxx

236 technical str

23 Technic street

0.xxxxxxxx

0.xxxxxxxx

448 madera

448 Made a

0.xxxxxxx

0.xxxxxxx

Just one file that has both the reference field and look up field.

No comments:

Post a Comment