Python Hash Sets Explained & Demonstrated - Computerphile

113,723

3,196 0

Published 2024-02-01

Featuring Mike Pound. Jane Street skyscraper puzzle (and info on the AMP program) at bit.ly/computerphile-amp --- More below ↓↓↓

Hash Sets in Python work a little bit like the index of a book, giving you a shortcut to looking for a value in a list. Dr Mike Pound explains how they work and demos with some code.

#Python #HashSet #Code #Computerphile

Jane Street’s Academy of Math and Programming is now accepting applications for their summer 2024 program, which will run from June 29th-August 2nd in NYC... Or you can just check out the puzzle for fun too - bit.ly/computerphile-amp (episode sponsor)

www.facebook.com/computerphile
twitter.com/computer_phile

This video was filmed and edited by Sean Riley.

Computer Science at the University of Nottingham: bit.ly/nottscomputer

Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com/

Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com/

All Comments (21)

@Computerphile 6 months ago

Jane Street skyscraper puzzle (and info on the AMP program) at bit.ly/computerphile-amp
@Davourflave 6 months ago

"O(1) means that there is no relationship between the speed of the lookup and the number of elements in your collection". Couldn't have said it better, as often with big O, the devil is in the constant you dropped during notation :D
@Michael-sh1fb 6 months ago

I see Mike Pound, I click
@ejc4684 6 months ago

Always love a Mike Pound video!!!
@MichaelKingsfordGray 6 months ago

I invented and implemented this very scheme in 1978, on an HP9845A in HP Basic with a 20 MB hard disk, and discovered a few things: 1) Hash collisions are best stored in a sorted list, so that a binary search can be done, reducing the search time dramatically. 2) Hashing integers as themselves is a disaster in the real world, where initial keys of 0 proliferate. (Amongst other common integers, such as -1.)
@Richardincancale 6 months ago

Implemented exactly the simple form of this is a commercial compiler around 1980 to store the symbol table (list of all identifiers defined in the program being compiled, what type, size etc.). Chosen for lookup speed as the symbol table is accessed frequently in the compilation process
@trevinbeattie4888 6 months ago

Side note: if your list of values is static and known in advance, the Gnu “gperf” tool can come up with a “perfect” hash function that gives you the minimum array size with no collisions. It generates C/C++ code, but the output should be portable to most other languages with a small amount of effort.
@bensmith9253 6 months ago

I saw an interview question video yesterday about these - really good timing for me this video. 😁😁😁
@gustafsundberg 6 months ago

I implemented this in ADA back in 1997 during a class in computer science. I think 73 was a pretty good prime to use while hashing to minimize collisions.
@JaapVersteegh 6 months ago

Not really interested in the topic, because I already know this, but still watched because Mike's presentations are always engaging
@prkhrsrvstv1 6 months ago

Bloom filters could be a good follow-up to this.
@jfftck 6 months ago

I watched a talk on Python dictionaries, the guy that worked on the new implementation had gone into detail how they are more closely related to databases than hash maps. It was done to increase performance, and since almost everything in Python has a backing dictionary, it made a large difference in runtime.
@exchange4918 6 months ago

Thanks for your videos Mike keep it rolling 🎉
@jonny__b 6 months ago

Oftentimes on modern hardware, particularly on small datasets, a linear scan can be faster than a hashmap lookup because the hashing function is slow.
@ibrahimmahrir 6 months ago

12:02 in the __contains__ function there shouldn't be an else: before the return False at the end, otherwise in case if the list self._data[idx] is not None and the item is not in that list, the return value won't be a boolean.
@KipIngram 4 months ago

It's important to point out here that probability of collision can be reduced by increasing table size, but then your "utilization" of your table space will be lower. It's absolutely a trade-off and you can't improve one without degrading the other. As your table fills up, collisions will become more and more common. For example, if your table is 80% full, then almost by definition there's an 80% chance of a new item colliding with an existing one. The uniformity of the hash function pretty much guarantees it. There's a lot of nice probability theory analysis you can do around these things. Of course, that 80% full table gives you a 64% chance of colliding on your first two tries, a 51.2% chance of failing on the third probe, a 41% chance of failing on the fourth, and so on. The average length of the chains you wind up with goes up sharply as you push up utilization.
@cinderwolf32 6 months ago

The topics discussed on this channel have been the ones that really specifically interest me as of late. This is cool, thank you!
@Loki- 6 months ago

I'm not new to programming, but I'm new to Python and I was just literally looking into what uses hash tables in Python. Thanks. Lol
@princevijay871 6 months ago

Just what i wanted now. thanks a lot :)
@martinparidon9056 5 months ago

Very interesting, thank you!