Interesting data set. I am building a new kind of data analysis tool (https://www.Didgets.com) so I am always looking for good open data sets to download, import into my tool, and see what the data shows and to test out my tool.
I downloaded both CSV files (geometry and simulations) and built a couple relational tables with them in a few minutes. I am confused by a few things. There are 42,207 unique values in the 'apartment_id' column. The most common one is d41d8cd98f00b204e9800998ecf8427e which is referenced 1451 times. At first I thought that it might actually be some kind of 'plan_id' where the same plan was used to build multiple apartments (this id is associated with 13 different 'building_id' values) but drilling down to each one reveals some very different features.
It is certainly possible that the same plan could be used with slight variations (e.g. one has a tub in the bathroom while another had a shower installed), but some of the features were very unique. For example there are 26 different KITCHEN areas associated with the id, but only 21 LIVING_DINING areas.
My tool is great for finding and fixing anomalies in data sets if they exist. This one is a bit confusing about what some elements mean and the site doesn't explain them very well.
If the same plan is being used across multiple buildings, it might be interesting to see how the amount of light entering the building differs based on if the same plan was used to build an apartment on the north side of a building vs the south side.
I think d41d8cd98f00b204e9800998ecf8427e is the md5 sum of "nothing", e.g. md5 reverse will get you a zero-length stream of characters.
(granted this is entirely without looking at the data) but my guess is that they MD5 hashed whatever was in that apartment_id column and if it was empty it spat out d41d8cd98f00b204e9800998ecf8427e
Suggestion: add “is this null”/“is this common?” to your analysis tool. It might take determining that the hashing method is for each dataset “column”, but this kind of trap is everywhere and your users would probably be delighted when they see that’s already identified.
co-author here, thanks a lot for your feedback! We will make sure to clarify on the apartment_id field in the next version. The apartment id is actually the unique id for an apartment in a single site (note: it happens an apartment spans multiple floors) - but it is not necessarily unique over the whole dataset.
We will also include a plan_id field that allows to identify which floors of a building are repeated (the apartment_ids differ already though).
I downloaded both CSV files (geometry and simulations) and built a couple relational tables with them in a few minutes. I am confused by a few things. There are 42,207 unique values in the 'apartment_id' column. The most common one is d41d8cd98f00b204e9800998ecf8427e which is referenced 1451 times. At first I thought that it might actually be some kind of 'plan_id' where the same plan was used to build multiple apartments (this id is associated with 13 different 'building_id' values) but drilling down to each one reveals some very different features.
It is certainly possible that the same plan could be used with slight variations (e.g. one has a tub in the bathroom while another had a shower installed), but some of the features were very unique. For example there are 26 different KITCHEN areas associated with the id, but only 21 LIVING_DINING areas.
My tool is great for finding and fixing anomalies in data sets if they exist. This one is a bit confusing about what some elements mean and the site doesn't explain them very well.
If the same plan is being used across multiple buildings, it might be interesting to see how the amount of light entering the building differs based on if the same plan was used to build an apartment on the north side of a building vs the south side.