Beep:
In this thesis we present three methods for postal address extraction - rule-based, machine-learning and hybrid. Our machine-learning based system combines different sources of weak evidence and uses the word n-gram model as the underlying representation of web pages. The extracting process is fast and accurate despite not using dictionary look-up. It outperforms both rule-based systems that we built, the accuracy of the system is further improved with precision of 95.2% and recall of 81.1%.
Wat, waar ben ik. Mag meekijken bij iemand die zich deze kennis aan het eigen maken is:
The problem of high-performance address extraction is closely related to the areas of
Named Entity Extraction and Geoparsing/Geocoding.
Toch nog maar even recapituleren dan, adressen, nee die hebben we al.1 2 Het is van een onderdeel van deze adressen, de straatnaam, dat je wil weten, waar komt die naam vandaan, wie of wat was dat?
Inderdaad, de vraag is of je dit eventueel door computers kunt laten uitzoeken, het zijn veel namen. En scheelt weer in de aanpak, iedereen kwam voor verschillende plaatsen in het land toch vaak op dezelfde namen.
Evengoed wel iets om jaloers op te zijn:
This sounds like a problem to be solved with bidirectional LSTM classification. You tag each character of the sample as one category for example
street: 1 city: 2 province: 3 postcode: 4 country: 5
1600 Pennsylvania Ave, Washington, DC 20500 USA 111111111111111111111, 2222222222, 33 44444 555
Now, train your classifier based on these labels. Boom!
However, there are a number of factors that make highly accurate address extraction difficult:
The same name may be written in different formats because of individual preference or other considerations.
Dat hebben we geweten, PTT, weg-er-mee. Maximaal 17 posities voor de veldlengte van straatnamen. Dit ten behoeve van de kleinste destijds bekende adresdragers, Cheshire-etiket, 5 per baan:
V T V KL POORTJE
V W VD GRACHTSTR
F MEERBURG SR KD
Al slaat u mij dood?
Background and Related Work:
Named Entity Recognition (NER) is the task of extracting entities, such as proper name (person, location and organization), time (date and time), and numerical values (currency and percentage), from the source text, and mapping them into predefined categories, such as person, organization, location name or “none-of-the-above”. It is a part of the Information Extraction research domain aimed at extracting various information from unstructured text-based data sources.
NER:
The Knowledge Graph Search API lets you find entities in the Google Knowledge Graph:
- Getting a ranked list of the most notable entities that match certain criteria.
- Predictively completing entities in a search box.
- Annotating/organizing content using the Knowledge Graph entities.
Typical use case:
https://kgsearch.googleapis.com/v1/entities:search?query=Pennsylvania%20Ave
Mama deur open! En ik wil dat het licht aanblijft:
The White House is the official residence and workplace of the President of the United States. It is located at 1600 Pennsylvania Avenue NW in Washington, D.C.
Effective? We demonstrated that the machine-learning approach to address extraction is more affective:
Special thanks to my wife, who has played a critical role in the completion of this thesis: without her love and encouragement, this thesis would never have been possible.
1 voldoen aan Europese afspraken maar voor de zekerheid wat later publiceren dan de betaalde versie van het Kadaster: – http://geodata.nationaalgeoregister.nl/inspireadressen/atom/inspireadressen.xml
2 voor dat geld krijg je van het Kadaster niet te horen waar een straat precies gelegen is, van Rijkswaterstaat wel: – http://geodata.nationaalgeoregister.nl/nwbwegen/atom/nwbwegen_wegvakken.xml