Tuesday, March 28, 2023
HomeBig Data­­Use fuzzy string matching to approximate duplicate data in Amazon Redshift

­­Use fuzzy string matching to approximate duplicate data in Amazon Redshift


Amazon Redshift is a completely managed, petabyte-scale knowledge warehouse service within the cloud. Amazon Redshift allows you to run complicated SQL analytics at scale and efficiency on terabytes to petabytes of structured and unstructured knowledge, and make the insights extensively out there via common enterprise intelligence (BI) and analytics instruments.

It’s widespread to ingest a number of knowledge sources into Amazon Redshift to carry out analytics. Usually, every knowledge supply can have its personal processes of making and sustaining knowledge, which may result in knowledge high quality challenges inside and throughout sources.

One problem you might face when performing analytics is the presence of imperfect duplicate data throughout the supply knowledge. Answering questions so simple as “What number of distinctive prospects do now we have?” might be very difficult when the info you could have out there is like the next desk.

Title Tackle Date of Start
Cody Johnson 8 Jeffery Brace, St. Lisatown 1/3/1956
Cody Jonson 8 Jeffery Brace, St. Lisatown 1/3/1956

Though people can establish that Cody Johnson and Cody Jonson are almost definitely the identical individual, it may be tough to tell apart this utilizing analytics instruments. This identification of duplicate data additionally turns into practically inconceivable when engaged on massive datasets throughout a number of sources.

This submit presents one potential method to addressing this problem in an Amazon Redshift knowledge warehouse. We import an open-source fuzzy matching Python library to Amazon Redshift, create a easy fuzzy matching user-defined operate (UDF), after which create a process that weights a number of columns in a desk to seek out matches based mostly on consumer enter. This method means that you can use the created process to roughly establish your distinctive prospects, bettering the accuracy of your analytics.

This method doesn’t remedy for knowledge high quality points in supply techniques, and doesn’t take away the necessity to have a wholistic knowledge high quality technique. For addressing knowledge high quality challenges in Amazon Easy Storage Service (Amazon S3) knowledge lakes and knowledge pipelines, AWS has introduced AWS Glue Knowledge High quality (preview). You can too use AWS Glue DataBrew, a visible knowledge preparation software that makes it simple for knowledge analysts and knowledge scientists to wash and normalize knowledge to organize it for analytics.

Stipulations

To finish the steps on this submit, you want the next:

The next AWS CloudFormation stack will deploy a brand new Redshift Serverless endpoint and an S3 bucket to be used on this submit.

BDB-2063-launch-cloudformation-stack

All SQL instructions proven on this submit can be found within the following pocket book, which might be imported into the Amazon Redshift Question Editor V2.

Overview of the dataset getting used

The dataset we use is mimicking a supply that holds buyer data. This supply has a guide means of inserting and updating buyer knowledge, and this has led to a number of cases of non-unique prospects being represented with duplicate data.

The next examples present a number of the knowledge high quality points within the dataset getting used.

On this first instance, all three prospects are the identical individual however have slight variations within the spelling of their names.

id identify age address_line1 metropolis postcode state
1 Cody Johnson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia
101 Cody Jonson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia
121 Kody Johnson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia

On this subsequent instance, the 2 prospects are the identical individual with barely totally different addresses.

id identify age address_line1 metropolis postcode state
7 Angela Watson 59 3/752 Bernard Comply with Janiceberg 2995 Australian Capital Territory
107 Angela Watson 59 752 Bernard Comply with Janiceberg 2995 Australian Capital Territory

On this instance, the 2 prospects are totally different individuals with the identical deal with. This simulates a number of totally different prospects dwelling on the identical deal with who ought to nonetheless be acknowledged as totally different individuals.

id identify age address_line1 metropolis postcode state
6 Michael Hunt 69 8 Santana Relaxation St. Jessicamouth 2964 Queensland
106 Sarah Hunt 69 8 Santana Relaxation St. Jessicamouth 2964 Queensland

Load the dataset

First, create a brand new desk in your Redshift Serverless endpoint and replica the take a look at knowledge into it by doing the next:

  1. Open the Question Editor V2 and log in utilizing the admin consumer identify and particulars outlined when the endpoint was created.
  2. Run the next CREATE TABLE assertion:
    create desk buyer (
        id smallint, 
        urid smallint,
        identify varchar(100),
        age smallint,
        address_line1 varchar(200),
        metropolis varchar(100),
        postcode smallint,
        state varchar(100)
    )
    ;

    Screenshot of CREATE TABLE statement for customer table being run successfully in Query Editor V2

  3. Run the next COPY command to repeat knowledge into the newly created desk:
    copy buyer (id, identify, age, address_line1, metropolis, postcode, state)
    from ' s3://redshift-blogs/fuzzy-string-matching/customer_data.csv'
    IAM_ROLE default
    FORMAT csv
    REGION 'us-east-1'
    IGNOREHEADER 1
    ;

  4. Verify the COPY succeeded and there are 110 data within the desk by working the next question:
    choose depend(*) from buyer;

    Screenshot showing the count of records in the customer table is 110. Query is run in Query Editor V2

Fuzzy matching

Fuzzy string matching, extra formally often called approximate string matching, is the strategy of discovering strings that match a sample roughly relatively than precisely. Generally (and on this answer), the Levenshtein distance is used to measure the gap between two strings, and subsequently their similarity. The smaller the Levenshtein distance between two strings, the extra comparable they’re.

On this answer, we exploit this property of the Levenshtein distance to estimate if two prospects are the identical individual based mostly on a number of attributes of the shopper, and it may be expanded to swimsuit many alternative use instances.

This answer makes use of TheFuzz, an open-source Python library that implements the Levenshtein distance in a number of alternative ways. We use the partial_ratio operate to match two strings and supply a consequence between 1–100. If one of many strings matches completely with a portion of the opposite, the partial_ratio operate will return 100.

Weighted fuzzy matching

By including a scaling issue to every of our column fuzzy matches, we are able to create a weighted fuzzy match for a document. That is particularly helpful in two situations:

  • We’ve got extra confidence in some columns of our knowledge than others, and subsequently need to prioritize their similarity outcomes.
  • One column is for much longer than the others. A single character distinction in a protracted string can have a lot much less impression on the Levenshtein distance than a single character distinction in a brief string. Subsequently, we need to prioritize lengthy string matches over quick string matches.

The answer on this submit applies weighted fuzzy matching based mostly on consumer enter outlined in one other desk.

Create a desk for weight data

This reference desk holds two columns; the desk identify and the column mapping with weights. The column mapping is held in a SUPER datatype, which permits JSON semistructured knowledge to be inserted and queried straight in Amazon Redshift. For examples on methods to question semistructured knowledge in Amazon Redshift, seek advice from Querying semistructured knowledge.

On this instance, we apply the most important weight to the column address_line1 (0.5) and the smallest weight to the metropolis and postcode columns (0.1).

Utilizing the Question Editor V2, create a brand new desk in your Redshift Serverless endpoint and insert a document by doing the next:

  1. Run the next CREATE TABLE assertion:
    CREATE TABLE ref_unique_record_weight_map(table_name varchar(100), column_mapping SUPER);

  2. Run the next INSERT assertion:
    INSERT INTO ref_unique_record_weight_map VALUES (
        'buyer',
        JSON_PARSE('{
        "colmap":[
        {
            "colname": "name",
            "colweight": 0.3
        },
        {
            "colname": "address_line1",
            "colweight": 0.5
        },
        {
            "colname": "city",
            "colweight": 0.1
        },
        {
            "colname": "postcode",
            "colweight": 0.1
        }
        ]
    }')
    );

  3. Verify the mapping knowledge has inserted into the desk accurately by working the next question:
    choose * from ref_unique_record_weight_map;

    Screenshot showing the result of querying the ref_unique_record_weight_map table in Query Editor V2

  4. To verify all weights for the buyer desk add as much as 1 (100%), run the next question:
    choose  cm.table_name, 
            sum(colmap.colweight) as total_column_weight 
    from    ref_unique_record_weight_map cm, cm.column_mapping.colmap colmap 
    the place   cm.table_name="buyer"
    group by cm.table_name;

    Screenshot showing the total weight applied to the customer table is 1.0

Person-defined capabilities

With Amazon Redshift, you’ll be able to create customized scalar user-defined capabilities (UDFs) utilizing a Python program. A Python UDF incorporates a Python program that runs when the operate is named and returns a single worth. Along with utilizing the usual Python performance, you’ll be able to import your individual customized Python modules, such because the module described earlier (TheFuzz).

On this answer, we create a Python UDF to take two enter values and examine their similarity.

Import exterior Python libraries to Amazon Redshift

Run the next code snippet to import the TheFuzz module into Amazon Redshift as a brand new library. This makes the library out there inside Python UDFs within the Redshift Serverless endpoint. Be sure that to offer the identify of the S3 bucket you created earlier.

CREATE OR REPLACE LIBRARY thefuzz LANGUAGE plpythonu 
FROM 's3://<your-bucket>/thefuzz.zip' 
IAM_ROLE default;

Create a Python user-defined operate

Run the next code snippet to create a brand new Python UDF referred to as unique_record. This UDF will do the next:

  1. Take two enter values that may be of any knowledge sort so long as they’re the identical knowledge sort (equivalent to two integers or two varchars).
  2. Import the newly created thefuzz Python library.
  3. Return an integer worth evaluating the partial ratio between the 2 enter values.
CREATE OR REPLACE FUNCTION unique_record(value_a ANYELEMENT, value_b ANYELEMENT) 
RETURNS INTEGER IMMUTABLE
AS
$$
    from thefuzz import fuzz

    return fuzz.partial_ratio(value_a, value_b)
$$ LANGUAGE plpythonu;

You may take a look at the operate by working the next code snippet:

choose unique_record('Cody Johnson'::varchar, 'Cody Jonson'::varchar)

The consequence exhibits that these two strings are have a similarity worth of 91%.

Screenshot showing that using the created function on the Cody Johnson/Cody Jonson name example provides a response of 91

Now that the Python UDF has been created, you’ll be able to take a look at the response of various enter values.

Alternatively, you’ll be able to observe the amazon-redshift-udfs GitHub repo to put in the f_fuzzy_string_match Python UDF.

Saved procedures

Saved procedures are generally used to encapsulate logic for knowledge transformation, knowledge validation, and business-specific logic. By combining a number of SQL steps right into a saved process, you’ll be able to scale back spherical journeys between your functions and the database.

On this answer, we create a saved process that applies weighting to a number of columns. As a result of this logic is widespread and repeatable whatever the supply desk or knowledge, it permits us to create the saved process as soon as and use it for a number of functions.

Create a saved process

Run the next code snippet to create a brand new Amazon Redshift saved process referred to as find_unique_id. This process will do the next:

  1. Take one enter worth. This worth is the desk you wish to create a golden document for (in our case, the buyer desk).
  2. Declare a set of variables for use all through the process.
  3. Test to see if weight knowledge is within the staging desk created in earlier steps.
  4. Construct a question string for evaluating every column and making use of weights utilizing the burden knowledge inserted in earlier steps.
  5. For every document within the enter desk that doesn’t have a novel document ID (URID) but, it would do the next:
    1. Create a short lived desk to stage outcomes. This non permanent desk can have all potential URIDs from the enter desk.
    2. Allocate a similarity worth to every URID. This worth specifies how comparable this URID is to the document in query, weighted with the inputs outlined.
    3. Select the closest matched URID, however provided that there’s a >90% match.
    4. If there isn’t a URID match, create a brand new URID.
    5. Replace the supply desk with the brand new URID and transfer to the subsequent document.

This process will solely ever search for new URIDs for data that don’t have already got one allotted. Subsequently, rerunning the URID process a number of occasions can have no impression on the outcomes.

CREATE OR REPLACE PROCEDURE find_unique_id(table_name varchar(100)) AS $$
DECLARE
    unique_record RECORD;
    column_info RECORD;

    column_fuzzy_comparison_string varchar(MAX) := '0.0';
    max_simularity_value decimal(5,2) := 0.0;

    table_check varchar(100);
    temp_column_name varchar(100);
    temp_column_weight decimal(5,2);
    unique_record_id smallint := 0;
BEGIN
    /* 
        Test the ref_unique_record_weight_map desk to see if there's a mapping document for the offered desk.
        If there isn't a desk, elevate an exception
    */
    SELECT INTO table_check cm.table_name from ref_unique_record_weight_map cm the place cm.table_name = quote_ident(table_name);
    IF NOT FOUND THEN
        RAISE EXCEPTION 'Enter desk ''%'' not present in mapping object', table_name;
        RETURN;
    END IF;

    /*
        Construct question for use to match every column utilizing the mapping document within the ref_unique_record_weight_map desk.
        For every column specified within the mapping object, append a weighted comparability of the column
    */
    FOR column_info IN (
        choose  colmap.colname::varchar(100) column_name, 
                colmap.colweight column_weight 
        from    ref_unique_record_weight_map cm, cm.column_mapping.colmap colmap 
        the place   cm.table_name = quote_ident(table_name)
    ) LOOP
        temp_column_name = column_info.column_name;
        temp_column_weight = column_info.column_weight;
        
        column_fuzzy_comparison_string = column_fuzzy_comparison_string || 
            ' + unique_record(t1.' || 
            temp_column_name || 
            '::varchar, t2.' || 
            temp_column_name || 
            '::varchar)*' || 
            temp_column_weight;
    END LOOP;

    /* Drop non permanent desk if it exists */
    EXECUTE 'DROP TABLE IF EXISTS #unique_record_table';

    /*
        For every document within the supply desk that doesn't have a Distinctive Document ID (URID):
            1. Create a brand new non permanent desk holding all potential URIDs for this document (i.e. all URIDs which have are current). 
                Observe: This non permanent desk will solely be current whereas the simularity verify is being calculated
            2. Replace every potential URID within the non permanent desk with it is simularity to the document being checked
            3. Discover essentially the most simular document with a URID
                3a. If essentially the most simular document is at the very least 90% simular, take it is URID (i.e. this isn't a novel document, and matches one other within the desk)
                3b. If there isn't a document that's 90% simular, create a brand new URID (i.e. it is a distinctive document)
            4. Drop the non permanent desk in preparation for the subsequent document
    */
    FOR unique_record in EXECUTE 'choose * from ' || table_name || ' the place urid is null order by id asc' LOOP

        RAISE INFO 'take a look at 1';

        /* Create non permanent desk */
        EXECUTE '
            CREATE TABLE #unique_record_table AS 
            SELECT id, urid, 0.0::decimal(5,2) as simularity_value 
            FROM ' || table_name || '
            the place urid isn't null
            ';

        /* Replace simularity values in non permanent desk */
        EXECUTE '
            UPDATE #unique_record_table  
            SET simularity_value = spherical(calc_simularity_value,2)::decimal(5,2)
            FROM (
                SELECT ' || column_fuzzy_comparison_string || ' as calc_simularity_value,
                        t2.id as upd_id
                FROM ' || table_name || ' t1
                INNER JOIN ' || table_name || ' t2
                ON t1.id <> t2.id
                AND t1.id = ' || quote_literal(unique_record.id) || '
                ) t
            WHERE t.upd_id = id
            ';

        /* Discover largest simularity worth */
        SELECT INTO max_simularity_value simularity_value FROM (
            SELECT  MAX(simularity_value) as simularity_value 
            FROM    #unique_record_table
        );

        /* If there's a >90% comparable match, select it is URID. In any other case, create a brand new URID */
        IF max_simularity_value > 90 THEN
            SELECT INTO unique_record_id urid FROM (
                SELECT urid
                FROM #unique_record_table
                WHERE simularity_value = max_simularity_value
            );
        ELSE 
            EXECUTE 'choose COALESCE(MAX(urid)+1,1) FROM ' || table_name INTO unique_record_id;
        END IF;
        
        /* Replace desk with new URID worth */
        EXECUTE 'UPDATE ' || table_name || ' SET urid = ' || quote_literal(unique_record_id) || ' WHERE id = ' || quote_literal(unique_record.id);

        /* Drop non permanent desk and repeat course of */
        EXECUTE 'DROP TABLE #unique_record_table';

        max_simularity_value = 0.0;
    END LOOP;

END;
$$ LANGUAGE plpgsql;

Now that the saved process has been created, create the distinctive document IDs for the buyer desk by working the next within the Question Editor V2. This may replace the urid column of the buyer desk.

CALL find_unique_id('buyer'); 
choose * from buyer;

Screenshot showing the customer table now has values inserted in the URID column

When the process has accomplished its run, you’ll be able to establish what duplicate prospects got distinctive IDs by working the next question:

choose * 
from buyer
the place urid in (
    choose urid 
    from buyer 
    group by urid 
    having depend(*) > 1
    )
order by urid asc
;

Screenshot showing the records that have been identified as duplicate records

From this you’ll be able to see that IDs 1, 101, and 121 have all been given the identical URID, as have IDs 7 and 107.

Screenshot showing the result for IDs 1, 101, and 121

Screenshot showing the result for IDs 7, and 107

The process has additionally accurately recognized that IDs 6 and 106 are totally different prospects, and so they subsequently don’t have the identical URID.

Screenshot showing the result for IDs 6, and 106

Clear up

To keep away from incurring future reoccurring costs, delete all information within the S3 bucket you created. After you delete the information, go to the AWS CloudFormation console and delete the stack deployed on this submit. This may delete all created assets.

Conclusion

On this submit, we confirmed one method to figuring out imperfect duplicate data by making use of a fuzzy matching algorithm in Amazon Redshift. This answer means that you can establish knowledge high quality points and apply extra correct analytics to your dataset residing in Amazon Redshift.

We confirmed how you should use open-source Python libraries to create Python UDFs, and methods to create a generic saved process to establish imperfect matches. This answer is extendable to offer any performance required, together with including as a daily course of in your ELT (extract, load, and remodel) workloads.

Take a look at the created process in your datasets to analyze the presence of any imperfect duplicates, and use the information realized all through this submit to create saved procedures and UDFs to implement additional performance.

In the event you’re new to Amazon Redshift, seek advice from Getting began with Amazon Redshift for extra data and tutorials on Amazon Redshift. You can too seek advice from the video Get began with Amazon Redshift Serverless for data on beginning with Redshift Serverless.


In regards to the Creator

Sean Beath is an Analytics Options Architect at Amazon Internet Providers. He has expertise within the full supply lifecycle of information platform modernisation utilizing AWS companies and works with prospects to assist drive analytics worth on AWS.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments