Finding near-duplicates with Jaccard similarity and MinHash

(blog.nelhage.com)

112 points | by brianyu8 3 days ago

5 comments

  • tpoacher 3 days ago
    One thing many people miss about set based metrics like the jaccard similarity (aka Tanimoto coefficient) and F1 score (aka Dice coefficient), is that they can also be used identically with fuzzy sets.

    The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.

    I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.

    Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...

    [0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...

    [1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...

  • BiteCode_dev 3 days ago
    I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well.

    Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/).

    With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:

    https://pypi.org/project/rensa/

    Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.

    • RobinL 3 days ago
      For deduplicating people, the Fellegi Sunter model is also a powerful approach. Splink[0] is a free Python library that implements this for big datasets. Probably you could combine parts of both approaches as well. Full disclosure: I am the lead author.

      I've also written up some interactive tutorials on how the method works [1] if anyone's interested

      [0]https://github.com/moj-analytical-services/splink [1]https://www.robinlinacre.com/intro_to_probabilistic_linkage/

      • BiteCode_dev 3 days ago
        Interesting.

        What would you say are the main differences with other approaches?

        • RobinL 3 days ago
          The Fellegi Sunter model is able to estimate the importance of different types of information from the data itself (i.e. unsupervised learning).

          For instance, a match on a date of birth column lends a greater weight of evidence in favour of a match than a match on first name (since dob has higher cardinality).

          The method is also able to estimate weights for fuzzy matches (how much evidence in favour of a match is close match on dob with one character difference), and also how much evidence against a match a mismatch is.

          For instance, if you have very high data quality on gender, then a match on gender doesn't tell you much, but a mismatch on gender is quite strong evidence against the idea two records match.

          I have a blog post here that delves into this a bit more: https://www.robinlinacre.com/fellegi_sunter_accuracy/

          • BiteCode_dev 3 days ago
            Ok, so you get better accuracy if you datasets have obviously weighted fields. Do you pay that in perfs, and if yes how much?
            • RobinL 3 days ago
              The performance is pretty good because the prediction is ultimately just adding up match weights. Much of the performance is dictated by

              (1) The blocking approach you choose (how wide you cast the net in searching for matches. This is actually somewhere were minhash can be used in conjunction

              (2) whether you choose to use complex fuzzy matching functions and how many - this is chosen by the user.

              There's some benchmarking results here: https://www.robinlinacre.com/fast_deduplication/

              Overall it's an approach which makes a pretty good trade off between speed and accuracy. That's why it's used by many national stats institutes (UK, US, Aus, Germany etc.) - because it's capable of working on country-population sized datasets.

    • molodec 3 days ago
      There is also gaoya ( I am the author ), which is also written in Rust and has python bindings. datasketch is great, but it is not performant enough for my use case. gaoya is used in a production system for large scale clustering https://github.com/serega/gaoya
  • pkeenan11 3 days ago
    Holy synchronicity Batman! I just implemented a minhash system that some may find interesting.

    The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.

    Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.

    So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.

    This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.

    In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows

  • vivzkestrel 3 days ago
    I have this problem right now in postgres. I have 600000 feed_items with the schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) The content and summary columns especially for some of the news items are highly similar but not equal. For any given 2 such news items, I am trying to cut them down to 1. Any ideas?
    • dleeftink 3 days ago
      I've implemented a minhash-like system in Bigquery, and was able to calculate cosine similarities (within a certain similarity threshold) between all Stack Overflow in reasonable time. To give you a broad idea of the procedure:

      1. Concatenate and split all text fields into an array of ngrams (2...n chars)

      2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers

      3. Hash each ngram to a 32-64 bit integer, multiply the resulting hash by each random value in array A, and modulo the resulting output with each random value in array B. Take the minimum value. Per row, your aim is to end up with an array of 'minhashed' integers of equal length to the arrays in step 2. If you declare the global arrays to be length of 64, the minhashed array for each row will also be of length 64.

      4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)

      If all went well, you can now unnest these arrays (call them 'source rows') and self join the dataset on each bucketed minhash value (resulting in an additional column of 'target rows'). Group by source, target columns and count the occurances to get an indication of how likely two rows are similar.

      In essence, the more often two items hash to a similar bucket, the more similar they are, and it's up to you to define the cut-off from where to calculate actual pairwise Jaccard (or cosine) similarities between items.

    • RobinL 3 days ago
      One useful technique here could be to use text embeddings and cosine similarity: https://simonwillison.net/2023/Oct/23/embeddings/
      • swasheck 3 days ago
        love this and have been using tf/idf for embeddings and various measures of similarity for some personal pet projects. one thing i came across in my research is that cosine similarity was more useful for vectors of different lengths and that euclidean distance was useful for vectors of similar length but simon alludes to a same-length requirement. i’m not formally trained in this area so i was hoping someone could shed some light on this for me.
    • probably_wrong 3 days ago
      If you think two items cover the same highly similar _keywords_ then Jaccard distance should work.

      If you think two items share highly similar _text_ then give Levenshtein distance a try.

  • is_true 3 days ago
    I had to do this with sport teams but I used levenshtein for the names. I ended up creating a vector with other data (country, gender) and using that vector to calculate the distance (similarity).