The "Technical and economic passports of apartment buildings" is a table, one of columns are addresses. To use this data, the table should be geocoded
- a point with coordinates should be assigned to each record. I've made a simple Python script, with online geocoding service adding longitude and latitude columns to the table.
СityWalls.ru webpages have simple structure - it means data could be easily collected.
Let's make simply parser in Python - it collects URL to the house webpage, CityWalls-ID, text string - year(s) built, architectural style, architect's name, name, address and link to the photo. The coordinates from site could not be used - often users put a pin near to the house, not on it - that's why I'm geocoding the layer by myself with the script we already have. Outcome of this stage - geographical points with attributes for 27 thousand of buildings.
Assigning points from "Passports" and CityWalls to buildings polygons with a simple rule: "point should be on the polygon". Now every polygon has a table with bunch of information from different sources:
- "Object-address system of Saint-Petersburg"
- "Technical and economic passports of apartment buildings"
Such a mess! The goal of the current stage is to good quality attributes for each building:
- year built
- link to CityWalls page
- link to photo
We determine the priority of sources for attributes, for example address string quality is best in "Object-address system", good in "Passports" and equal quality in CityWalls and OSM entries.
Date mining is a special type of fun: for data analysis we need an integer, but in all sources the date is a string with a very diverse format. It could be easy option "1703" or "1703г", year list "1703,2020", building period "1822-1917", epoch "before 1822", or outstanding "1 9471 94 8", and their combinations and exceptions.
After playing with the data a bit, making wrong decisions couple of times I find a rule that gives an acceptable result: take as a result the first four-digit number in string, if it is not a period otherwise take second four-digit number in the string. After a bit of pain with regular expressions magic, with one line of code all text fields are turned into numbers.