Interesting data set. I am building a new kind of data analysis tool (https://ww...

screature2 · on Oct 14, 2022

I think d41d8cd98f00b204e9800998ecf8427e is the md5 sum of "nothing", e.g. md5 reverse will get you a zero-length stream of characters.

(granted this is entirely without looking at the data) but my guess is that they MD5 hashed whatever was in that apartment_id column and if it was empty it spat out d41d8cd98f00b204e9800998ecf8427e

tleilaxu · on Oct 14, 2022

>Ah yes, I recognise that particular md5 hash value from memory.

Excellent, I certainly know I am reading HN.

Quarrel · on Oct 14, 2022

Yep:

 touch null ~  md5sum null d41d8cd98f00b204e9800998ecf8427e null

didgetmaster · on Oct 14, 2022

Wow! I have memorized some bizarre things before, but I don't think I ever went that far to recognize an md5 hash.

joshspankit · on Oct 15, 2022

If you’re old enough, you might have memorized “the” Windows XP key

2Gkashmiri · on Oct 15, 2022

H7C97-C67JB-G6RQR-P6H2Y-TMQ6W

literalAardvark · on Oct 15, 2022

I expect he meant Fckgw-

But good showing nonetheless.

2Gkashmiri · on Oct 15, 2022

this is actually XP sp2 so yeah

joshspankit · on Oct 15, 2022

Suggestion: add “is this null”/“is this common?” to your analysis tool. It might take determining that the hashing method is for each dataset “column”, but this kind of trap is everywhere and your users would probably be delighted when they see that’s already identified.

mfranzen · on Oct 15, 2022

co-author here, thanks a lot for your feedback! We will make sure to clarify on the apartment_id field in the next version. The apartment id is actually the unique id for an apartment in a single site (note: it happens an apartment spans multiple floors) - but it is not necessarily unique over the whole dataset.

We will also include a plan_id field that allows to identify which floors of a building are repeated (the apartment_ids differ already though).