Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How can you download the entire dataset?


You can download the entire dataset using curl (will be 40+ GB)

    curl -s --retry 10 --retry-all-errors --remote-name-all --parallel --parallel-max 150 "https://api.pwnedpasswords.com/range/{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F}{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F}{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F}{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F}{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F}"


It's not that I couldn't have written that oneliner, it's that I assumed you'd get blocked very quickly.


It is officially recommended by the Troy Hunt: https://github.com/HaveIBeenPwned/PwnedPasswordsDownloader/i...


That speaks to a certain confidence in one's servers ability to hold up under load, doesn't it?

"Oh you want your own copy? Sure, just thrash seven shades of shit out of the database. Here's how."


It's not a database, it's just files. And they are hosted by Cloudflare so they can cope with a lot of downloads.

I think he should make the files smaller my removing the second half of the hashes, i.e. reduce it from 40 hex digits to 20. This increases the change of a false positive (i.e. I enter my password, it says it was compromised but it wasn't, it just has the same hash as one that did) from 1 in 10^48 to 1 in 10^24 (per password), but that's still a huge number. (There's less than 10^10 people in the world, they only have a few passwords each). This will approximately halve the download, maybe more because the first half of each hash is more compressible (when sorted) the second half is totally random.


> It's not a database, it's just files. And they are hosted by Cloudflare so they can cope with a lot of downloads.

Database: a usually large collection of data organized especially for rapid search and retrieval (as by a computer) [1]

It is a database. Stop nitpicking.

[1] https://www.merriam-webster.com/dictionary/database


Confidence in Cloudflare, for sure.


That's crazy, thank you.


You are being purposefully obtuse here. HIBP is a very, very well established site with a long history of operating in good faith.


> > It's not that I couldn't have written that oneliner, it's that I assumed you'd get blocked very quickly.

> junon https://news.ycombinator.com/user?id=junon

> You are being purposefully obtuse here. HIBP is a very, very well established site with a long history of operating in good faith.

Allowing people to query and someone downloading the entire dataset is normally considered abuse, so being blocked is the expectation here. You're so dense you're bending light around you.


Several open source tools can be found on GitHub, but here’s the “official” one https://github.com/HaveIBeenPwned/PwnedPasswordsDownloader




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: