Skip to main content
Uncategorized

Is ChatGPT a game changer for geocoding – a benchmark for geocoding address parsing techniques*

October 22, 2023

Zhengcong Yin

Texas A&M University College Station, TX, USA

Diya Li

Texas A&M University College Station, TX, USA

Daniel W. Goldberg

Texas A&M University College Station, TX, USA

ABSTRACT

The remarkable success of GPT models across various tasks, including toponymy recognition motivates us to assess the performance of the GPT-3 model in the geocoding address parsing task. To ensure that the evaluation more accurately mirrors performance in real- world scenarios with diverse user input qualities and resolve the pressing need for a ‘gold standard’ evaluation dataset for geocoding systems, we introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C.; and consists of three subsets to be used as training, validation, and testing sets. Building on this, we train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models. The evaluation results indicate that Bidirectional LSTM-CRF model has achieved the best performance over these transformer-based models and GPT-3 model. Transformer-based models demonstrate very comparable results compared to the Bidirectional LSTM-CRF model. The GPT-3 model, though trailing in performance, showcases potential in the address parsing task with few-shot examples, exhibiting room for improvement with additional fine-tuning. We open source the code and data of this presented benchmark1 so that researchers can utilize it for future model development or extend it to evaluate similar tasks, such as document geocoding.

CCS CONCEPTS
  • Information systems Location-based service; • Computing methodologies Natural language processing;
KEYWORDS

Geocoding, Address parsing, Benchmark, NER, LLM, GPT

1        INTRODUCTION

Geocoding is the process of converting address descriptions into geographic coordinates [13] and has been widely used as a data- processing step in various domains to conduct spatial analysis, from enabling efficient urban planning to advancing public health [22, 29, 30, 41, 43]. However, the validity of conclusions of studies that employed geocoding as part of their workflow can be largely impacted by the quality of geocoded data. [19, 32, 40]. Although every step in a geocoding process can accumulate errors in the final output [8, 13], address parsing, which is to extract address components (shown in the Figure 1) from the address description input by users, plays a profound role in determining the quality of geocoded data. This is because outputs from the address parsing process are used to assemble query strings to retrieve matching candidates for further calculation and ranking to derive final geocoded outputs [8, 9, 39]. For example, if an address parser mistakenly recognizes the ‘116 S’ as the house number, the geocoding engine can not extract the correct geocode output from the reference dataset by employing ‘116 S’ as a house number search criterion. Moreover, geocoding input is proven to be error-prone, containing syntactic or semantic errors [3, 18]; such quality of user input demands an address parser to handle them appropriately to ensure the quality of geocoded output.

Recently, significant breakthroughs have been witnessed in Large Language Models (LLMs) and their applications in Natural Language Processing (NLP). Models like GPT-3 [1] are making waves by setting new benchmarks in various tasks [36] and by shifting the model training workflow [7]. Researchers in the geospatial do- main have also evaluated the capability of Generative Pre-trained Transformer (GPT) models in handling toponymy recognition and location description recognition tasks and have shown promising results [27]. This raises the question that can GPT-based models make a difference in the task of address parsing?

Yet, there has been minimal effort in evaluating the performance of GPT-based models for address parsing or geocoding, and the favorable evaluation outcomes presented in [27] can not provide too many insights into the performance of GPT-based models on geocoding address parsing, as their evaluation framework [16] is not designed for the address geocoding task, and input for toponymy resolution tasks, which contain place names in short text messages, is different from the input for geocoding: a postal address description. In fact, in the realm of geocoding, a “gold-standard” benchmark dataset that can fully evaluate geocoding systems is highly demanded [19]. Compared to the magnitude of human input errors, input datasets in existing geocoding evaluation frameworks only contain relatively simple misspellings [3, 18]. We argue that evaluation data used in existing work did not fully reflect the quality of geocoding input in real scenarios, making the performance of a geocoding system and address parser remain uncertain when facing (erroneous) input in reality.

Therefore, in this work, we present a benchmark that is specific to evaluate geocoding address parsing techniques using synthesized low-quality input by mining human input patterns from real geocoding system logs, and we evaluate address parsing performance using GPT-3 model and compare to other transformer-based and recurrent neural network-based address parsing methods.

Figure 1: USPS standard address components of a postal address description

The contributions of this work can be summarized as follows.

  • A benchmark dataset that contains diverse address descrip- tions (e.g., highway and grid style) covering all U.S. states and 21 input errors and variations is generated by mining real geocoding system logs. A data processing pipeline is de- veloped to analyze input errors and variations occurring in different address components from real user input, and then the harvested inject errors and variations using these identi- fied patterns to synthesize low-quality geocoding input. To the best of our knowledge, this is the first publicly released annotated low-quality geocoding input dataset for U.S. ad- dresses with such magnitude of coverage and error/variation.
  • Address parsers built upon five different models (i.e., GPT-3 model, transformer-based model, and LSTM-based model) are evaluated by synthesized low-quality address input with different errors to reflect their performance when facing various input qualities in real scenarios. These evaluation results can provide insights into the potential capabilities of each model, especially the GPT-3 model, for further fine-tuning or enhancement.
  • The proposed benchmark, encompassing benchmark datasets and address parsing methods, is available as open source and can be accessed at Github2. Researchers could use the bench- mark dataset for other geospatial text processing tasks or use the evaluation results as baselines for future development and experimental comparisons. This proposed framework can be extended to synthesize language- or county-specific low-quality input to evaluate address parsing or geocoding systems in different countries.

The remainder of this paper is organized as follows. Section 2 summarizes recent work on address parsing techniques in geocoding and Name Entity Recognition in other domains. Section 3 describes the design details of the proposed benchmark, including the approach to synthesize the low-quality geocoding input, evaluated address parsing techniques, and evaluation metrics. In Section 4, we illustrate the evaluation outcomes and discuss the results. We conclude this paper with potential avenues for future work in Section 5.

2        RELATED WORK

Geocoding address parsing is a domain-specific Named Entity Recognition (NER) task that has received extensive research attention. In previous research endeavors, the primary competitor has centered on algorithms designed to improve address parsing capabilities. In the initial stage, these parsing algorithms were predominantly built on rule-based and statistical methodologies. Rule-based approaches usually leverage the format of local address schema and its hierarchy to determine the sequence of labels for a given address input [11]. Typically, a tire or tree-based data structure is used to mimic the hierarchy of address systems, string matching (i.e., forward/backward string method), beam search, heuristic search strategies, and fine-state machines are used to explore the possible label sequences for addresses. Given that rule-based methods heavily rely on address system rules and lexicons to recognize certain address components (e.g., road types), the variation of user input in terms of quality and descriptions could easily result in the “Out Of Vocabulary” issue. Later, statistical-based address parsing represents a learning and tagging process, as an annotated corpus is required for training, and sequence tagging algorithms make decisions for each label. Two popular models, Hidden Markov Models (HMMs) and conditional Random Fields (CRFs) have been used to build address parser [2, 34] and achieved SOTA at that time. To augment the coverage of the state transition matrix for variations,

[2] has enhanced the training data to contain intentionally manipulated addresses. Later, the hybrid-based address parser that combines the rule-based and statistical-based approaches [24] has shown better parsing performance. In recent years, research has shifted towards using neural networks and LLMs as the foundational framework for building address parsers[14, 15, 28, 31, 37], given their proven success in NER tasks across various domains [21]. Another avenue of research related to address parsing involves reducing the need for annotated data [4] or predicting noisy tokens in geocoding queries[35].

Given that address descriptions and formats differ among countries [37], the aforementioned studies are using input address descriptions from the address system specifics to their study areas, including U.S. [9], China [23], Japan [25], and India [28]. However, the lack of a standardized evaluation dataset for each individual address system complicates the direct comparison of the experimental results of different studies targeting the same country. Our work extends the existing works by presenting a unified evaluation framework, including a benchmark dataset, evaluation procedures, and evaluation metrics created specifically to assess geocoding address parsing. The benchmark dataset, which accounts for the heterogeneity in address formats and encompasses a wide range of input errors/variations, is publicly released to facilitate future research investigations.

3        BENCHMARK DESIGNS

This section describes how benchmark datasets are generated, the selection of evaluated models, and the metrics to access address parsing performance.

3.1        Benchmark Dataset

Figure 2 depicts the workflow to generate the benchmark dataset, namely, the low-quality geocoding input dataset. This workflow contains three major steps: (1) extracting ground-truth dataset; (2) building address component error injector that can generate common geocoding input error and variations; (3) synthesizing low-quality geocoding input.

Figure 2: Benchmark dataset processing workflow

3.1.1 Ground-truth data. The ground-truth data is generated by extracting from reference datasets, as reference datasets are the single source of truth for geocoding systems to perform the retrieval processing to derive final outputs. We extracted address description from the Navteq 2016 address point reference datasets 3 used by Texas A&M geocoding platform4, because this dataset has been utilized by other studies [38, 39], and every address description in this reference dataset has already been segmented and aligned with a USPS standard address component label shown in Figure 1. To ensure the diversity of address descriptions across the U.S., we first get the unique combination of address components except for house number (i.e., street name, predirectional, postdirectional, city name, and postal code) from each U.S. state and the District of Columbia, meaning that we obtained one address description from every street from all U.S. 50 states and D.C to formalize a unique address description dataset. Then, we further split this unique ad- dress dataset into three smaller datasets designated for training, validation, and testing procedures in this benchmark. The testing dataset is generated by extracting one address description from every pair of state and postal codes in the U.S., resulting in a dataset of 30,622 addresses. To obtain the training and validation datasets, we first exclude the testing dataset from the unique address collection; we then randomly select up to 9 addresses from every unique combination of city, state, and postal code from the unique address collection and put the first two addresses and the last address into the training and validation dataset, respectively, when applicable. In the end, the training and validation datasets have 148,173 and 60,522 addresses, respectively, and all address descriptions in the training, validation, and test datasets are mutually exclusive.

3.1.2 Address component error injector. To synthesize low-quality geocoding input, we build an address component error injector to randomly generate errors and variations based on human input patterns. To capture such patterns for geocoding input, we firstly extract three-month geocoding transactions from Texas A&M geocoding platform5 and only keep these inputs, which cannot lead to full matching scores (i.e., the reference data can only partially match with the input.) In total, we obtained roughly 30 million input queries. Next, we iterate each input and compare it to its corresponding reference data to detect input errors and variations. Since user input and matched reference data have already been segmented based on address components to seek a match by the geocoding platform, input errors can be found by aligning user input address descriptions with their corresponding description in address reference datasets. For example, if the city name is missing in the input compared to the corresponding reference data, the error of omission is detected. While iterating the historical user input data, we collect sets of mismatched input samples per each address component and then further distill to get cases of commonly used abbreviations and common substations for each address component. Totally, we identify 21 errors and variations on different address components listed in Table 1.

Lastly, we create the logic to generate these identified errors and variations by aligning and comparing between the segmented user input and the segmented reference data. Addition or omission errors can be generated by reversing the process of how these errors/variations are detected. For instance, a directional addition error can be identified by comparing the user input and the reference data; such an error can be synthesized by adding a directional to an address record in the reference data. For typographic errors, we employ the same mechanism of Freely Extensible Biomedical Record Linkage (FEBRL) [3] to randomly swap, delete, insert, or replace a character. We quantify the degree of typographic error by edit distance and set the probabilities of typographic errors with edit distance 1 or 2 to be the same. As for error/variation of abbreviation and substitution, we leverage the collected common cases to reproduce the error/variation, for example, replacing Los Angeles by LA for a city name input.

3.1.3 Low-quality geocoding input for benchmarks. The last step is synthesizing low-quality geocoding input used as the benchmark dataset to access address parsing techniques. We apply the address component error injector to the training, validation, and test ground-truth datasets obtained in Section 3.1.1. Specifically, we set the probability of injecting errors/variations to an address record in these three split datasets to 0.5, and the ratio of injecting two or one error/variation is 7:3. Every address component has the same chance to be manipulated to contain an error/variation. It’s worth noting that we only inject errors that are applicable to an address record. For example, if an address is an ordinal number street such as “5th Avenue”, it is applicable to have the error of ordinal number suffix omission to become “5 Avenue”. We intentionally reduce the postal code digits mismatched error to prevent it from dominating synthesized errors. To this end, training, validation, and test datasets all contain address records with zero, one, or two errors/variations, as summarized in Table 2. The distribution of each error/variation for each dataset is summarized in Figure 3. To label these three datasets for training and evaluation, we employ the IOB (Inside–outside–beginning) tagging scheme to assign the corresponding label to each chuck segmented by white space. For example, the city name Los Angeles would receive two labels: B-CITY and I-CITY.

3.2        Baseline models

The following section provides an overview of the baseline models utilized to build address parsers in our experiments, each represent- ing significant strides in the field of NLP.

Table 1: Geocoding input errors and variations

Address componentError/VariationExample
House numberOmission1600 Main St → Main St
Pre-/Post-directionalOmission Pre/Post-Direction swapEast Main St → Main St E Main St NW → NW Main St E
Street base nameTypo (edit distance 1) Typo (edit distance 2) Number suffix omission Spanish prefix omission Space omission Space addition Partial abbreviationMain St → Man St Main St → Mian St 5th Ave → 5 Ave La Brea Ave → Brea Ave Memory Hill → Memoryhill Reachcliff → Reach Cliff Warm Mountain → Warm Mtn
Road typeOmission Valid road type substitution Invalid road type substitutionMain St → Main Main St → Main Ave Main St → Main St St
CityOmission Typo (edit distance 1) Typo (edit distance 2) Direction addition Direction omission First character abbreviation Space additionHouston, TX 77845 → TX 77845 Austin → Austiun Luverne → Luvre Houston → South Houston North Little Rock → Little Rock Los Angeles → LA Redlands → Red Lands
StateOmissionHouston, TX 77001 → Houston, 77001
Postal codeOmission Any digits mismatchedHouston, TX 77001 → Houston, TX 77845 → 77843

Table 2: Frequency of address records with different quality in the benchmark dataset

Subset         Total       No error/variation One error/variation Two error/variation
Training  148,17374,08651,89822,189
Validation 60,52230,23021,2479,045
Test           30,62215,28610,7364,600

Bidirectional LSTM-CRF [17]: The Bidirectional LSTM-CRF model combines the strengths of both Bidirectional Long Short- Term Memory (Bi-LSTM) and Conditional Random Fields (CRF) for sequence labeling tasks. Bi-LSTM, a type of Recurrent Neural Network (RNN), is capable of capturing the context from both directions of a sequence and hence is widely used for NLP tasks. CRF, on the other hand, is a statistical modeling method often used for structured prediction. In the context of NLP, CRFs are used to predict the most likely labels for a sequence of words. The Bidirectional LSTM-CRF model leverages the Bi-LSTM to extract complex features from input sequences, and then use the CRF to predict the optimal labeling sequence, considering both the input sequence and the correlation of labels, resulting in state-of-the-art performance on various sequence labeling tasks.

BERT [5]: Bidirectional Encoder Representations from Transformers (BERT), developed by Google, revolutionized the NLP landscape by introducing a novel pre-training objective known as Masked Language Model (MLM). This objective allows BERT to understand the context of a word by considering both its preceding and following words, a significant departure from previous models that only captured unidirectional contexts. Pre-trained on a substantial corpus of unlabelled text, including the entirety of Wikipedia and the Book Corpus [6, 42], BERT has shown remarkable performance across a variety of NLP tasks. We choose to use the standard bert-base-uncased model which contains 110M parameters.

roBERTa [26]: roBERTa, a variant of BERT introduced by Face- book, further refines the pre-training process. It eliminates the next-sentence pretraining objective, modifies several key hyper- parameters, and leverages larger mini-batches and learning rates. Additionally, roBERTa is trained on an augmented version of the BookCorpus dataset [6, 42], leading to improved performance over BERT in several benchmark tasks. We choose to use the standard roberta-base model which contains 125M parameters.

DistilBERT [33]: As a distilled variant of BERT, DistilBERT represents an effort to optimize the balance between model performance and resource efficiency. DistilBERT is 60% smaller in size, six times faster, yet retains 95% of BERT’s performance. This is achieved through a process known as distillation, where a smaller model (the student) is trained to mimic the behavior of a larger model (the teacher). We choose to use the standard distilbert-base-uncased model which contains 67M parameters.

GPT-3 [1]: GPT-3, also known as ChatGPT, the successor to GPT-2 and also developed by OpenAI, is an autoregressive language model with a staggering 175 billion machine learning parameters.

Figure 3: The distribution of synthesized geocoding input errors/variations in training (a), validation (b), and test (c) datasets

GPT-3’s size and complexity enable it to excel in tasks involving the generation of long, coherent text passages. In addition to this, GPT- 3 exhibits remarkable proficiency in translating between languages, answering questions, summarizing text, and more, making it one of the most versatile language models to date.

3.3        Evaluation metrics

Given the task of geocoding address parsing is to segment the input address description and assign a corresponding address component label to each segmentation based on the USPS address standard, we quantify the parsing performance by the standard NER evaluation metrics, namely, the precision and recall, and the F1 score (i.e., the harmonic mean of precision and recall) of every annotated label. Such a measurement indicates a parsing model’s capability to recognize all address components correctly. Since the output from geocoding address parsing is to build a query string to retrieve and rank matched candidates, it’s possible that not all address components would be used to build queries, and some address components are more important than others, depending on how the matching component of a geocoding system is built. To this end, we further calculate a score (denoted as parsing score) based on the weight of each address component used by Texas A&M geocoding platform6 using Equation 1 as follows.

where, 𝑊𝑎𝑑𝑑𝑟𝑒𝑠𝑠 and 𝐹 1𝑎𝑑𝑑𝑟𝑒𝑠𝑠 represents the weight and F1 score of every address component, respectively. The weight of each ad- dress component (shown in Table 3) is obtained from Texas A&M geocoding platform7 given its performance in [10, 12].

Table 3: Weight of each address component

Address ComponentWeight
House number20
Predirectional7
Street base name45
Road type10
Postdirectional4
City17
State1
Zip code45

4        EXPERIMENT RESULTS AND DISCUSSION

4.1        Model Implementation

Since the GPT-3 model generates output via user prompt, we con- ducted the NER task via the promptify library 8. This library sends a structured input to LLMs, which is equivalent to asking a properly structured question that would help these GPT-3 understand the question better. The API version we used is gpt-3.5-turbo 9. We supplied three examples to the GPT-3 model to help it understand the expectations for the output, as we found the output under the zero-shot scenario is suboptimal. These three examples listed be- low are randomly selected from the training dataset, containing pre-directional, post-directional, and no directional.

(1) 467 W BROOKWOOD CIR OZARK AL 36360

(2) 27195 DORY RD W SALVO NC 27972

(3) 118 LUKE HICKS RD HAZEL GREEN AL 35750

The three transformer-based models, along with the Bidirectional LSTM-CRF model, were implemented utilizing the Pytorch frame- work. These transformer-based models were built using the hugging face library 10. We trained these models using an Adam optimizer [20], a popular choice for training deep learning models due to its efficiency and low memory requirements. The initial learning rate was set to 0.00002, with the linear learning rate schedule type. The Adam optimizer was configured with beta1 and beta2 parameters set to 0.9 and 0.999, respectively. The dropout is set to 0.5, as we observed that the default dropout can easily lead to over-fitting in the initial stage of this experiment. The batch size is set to 30. The Bidirectional LSTM-CRF model was implemented on an open- sourced work11. Specifically, we employed GloVe.6B.100d 12 for word embedding to fed into neural network, the stochastic gradient descent optimizer with a learning rate of 0.1, the hidden size of an LSTM cell of 200, and a batch size of 10. We added an IOB label constraint for transition parameters to enforce valid transitions.

4.2        Experiment Settings

This experiment aims to compare the different baseline models’ performance on the task of geocoding address parsing. To have a fair comparison, we utilized the same datasets, run-time environment, and training/evaluation procedures to ensure any differences in performance could be attributed to the models’ architecture and capabilities rather than external factors. Among these five baseline models, the four (i.e., the Bidirectional LSTM-CRF model and three transformers-based models) require a training process, whereas the GPT-3 model does not need to train, as we directly leveraged the gpt-3.5 turbo API to conduct NER inference for address parsing. Thus, we first trained and evaluated these five baselines using training and validation datasets to get their trained models; we then applied these trained models and the GPT-3 model to the test dataset to compare their performance. We set the training epoch to be 25, as the preliminary experiment indicated the evaluation loss was less than 0.001. Each training model was evaluated on the validation dataset at the end of each epoch, and their evaluation loss was recorded. This allowed us to monitor the models’ learning progress and adjust the training parameters if necessary. All training and evaluation processes were conducted on Google Colaboratory with the Tesla V100 GPU.

4.3        Results and discussion

Figure 4 presents the trajectories of training and validation loss for the baseline models throughout the entirety of the experimental processes. The roBERTa model’s validation performance is initially high but experiences a rapid decrease as training progresses. In contrast, the other models exhibit a steady validation loss through- out the entire process. Most models reach convergence around the 20-epoch mark. Notably, the DistilBERT model stands out for its faster convergence rate than the other models. Having trained these four models, we then tested them alongside the GPT-3 model using the same test dataset detailed in Section 3.1. The evaluation results of the five evaluated baseline models are presented in Table 4, illustrating their effectiveness in recognizing and extracting individual address elements and the overall performance.

Across all address components, the Bidirectional LSTM-CRF model consistently demonstrates superior or comparable performance to the other models. For instance, in identifying the house number, this model achieved the highest F1 score of 0.99977, marginally surpassing the performance of roBERTa (0.99976) and BERT (0.99963). Its superiority is also evident in parsing the state and postal code components, where it yielded an F1 score of 0.99993 and a perfect score of 1.00000, respectively. The BERT model exhibits robust performance across all tasks, with its performance closely trailing that of the Bidirectional LSTM-CRF model. It performed particularly well in identifying the house number and postal code, with F1 scores of 0.99963 and 1.00000, respectively. Notably, the roBERTa model, while generally performing well, exhibited a slight drop in performance when parsing the postdirectional component, with an F1 score of 0.94003. The Bidirectional LSTM-CRF model also has the highest Parsing Score, indicating that it not only performs well in parsing each address component but also excels in parsing the components that carry the most weight in the geocoding process. This is significantly lower than the scores achieved by the other models for this task. On the other hand, the DistilBERT model’s performance was consistently high across all tasks, with its lowest F1 score being 0.96771 for the postdirectional component. Its performance was particularly strong in parsing the house number and postal code, achieving F1 scores of 0.99970 and 1.00000, respectively. The GPT-3 model, however, displayed a relatively lower performance compared to the other models. While it performed reasonably well in parsing the house number, state, and postal code with F1 scores of 0.98810, 0.97505, and 0.97851, respectively, it struggled significantly with the postdirectional component, achieving an F1 score of 0.42917, which is markedly lower than the scores of the other models.

The Bidirectional LSTM-CRF model consistently outperforms or matches the other models across all address components. This superior performance could be attributed to the inherent strengths of this model. The Bidirectional LSTM-CRF model combines the advantages of both bidirectional LSTM and conditional random fields, which allows the model to capture context from both past and future input while CRF can make the most of the sentence-level tag information, making it a powerful model for sequence labeling tasks such as NER. The BERT model and its variant, while performing robustly across all tasks, fall slightly behind the Bidirectional LSTM-CRF model in terms of performance. This could be due to the fact that while BERT is a powerful model, it is pre- trained on a masked language model and next-sentence prediction tasks, which may not be perfectly aligned with the NER tasks in address parsing. On the other hand, the pre-trained models have been trained on large amounts of data, their performance could potentially be improved with hyperparameter tuning to optimize them for the specific task of address parsing. This could involve adjusting parameters like learning rate and batch size, adding or removing layers, or changing the number of hidden units, among other things. However, its strong performance in identifying house numbers and postal codes suggests that it is still a valuable tool for these tasks. As a generative model, GPT-3 demonstrates a lower performance compared to others. One of the potential reasons for that is the GPT-3 generates output based on the context provided by the prompt. Therefore, the way the prompt is framed can significantly affect the model’s performance. The other reason could be the selected few-shot learning examples used by the GPT-3 model were completely error-free. It would be interesting to compare the impact of different few-shot learning examples on the GPT-3 model

Figure 4: The training and validation loss of baseline models

performance. It’s worth noting that the hyperparameters of evaluated models come from common settings used by other studies, as the main scope of this paper is to provide a solid foundation to facilitate future model evaluations. Fine-tuning hyperparameters for each model to find out their best performance can be one direction of future work.

5        CONCLUSION AND FUTURE WORK

In this work, we introduce a benchmark consisting of benchmark datasets and evaluation metrics to assess the performance of the GPT-3 model in geocoding address parsing and compare with three transformer-based models and one LSTM-based model. We create a benchmark dataset capturing 21 input errors/variations observed in real user input logs, and this dataset also contains the unique ad- dress formatting across the U.S. (i.e., 50 states and D.C). This helps to address the demand for a ’gold standard’ evaluation dataset in geocoding and further guarantees that evaluation results closely reflect their performance in real-world scenarios. Our findings reveal that the Bidirectional LSTM-CRF model slightly outperforms the transformer-based models. Though the GPT-3 model’s performance lags behind the other evaluated models, it shows encouraging results in address parsing using few-shot examples, suggesting room for improvement with additional fine-tuning. We aim this work to serve as a solid baseline for future development and experimental comparisons in similar geographic information retrieval-related tasks.

Future work includes (1) enhancing the evaluation benchmark dataset by capturing more input errors/variations, (2) fine-tuning the models to improve their performance to attempt to achieve SOTA performance in the given input dataset and comparing to traditional models (e.g., CRF and HMM models), and (3) extending this benchmark to be applicable to evaluate address parsing or geocoding systems in other countries, given the heterogeneity of language and address systems in different countries

REFERENCES

  • Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Peter Christen, Daniel Belacic, et al. 2005. Automated probabilistic address stan- dardisation and verification. In Australasian Data Mining Conference. Citeseer.
  • Peter Christen, Tim Churches, et al. 2002. Febrl-Freely extensible biomedical record linkage. (2002).
  • Helen Craig, Dragomir Yankov, Renzhong Wang, Pavel Berkhin, and Wei Wu. 2019. Scaling Address Parsing Sequence Models through Active Learning. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 424–427.
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Wikimedia Foundation. [n. d.]. Wikimedia Downloads. https://dumps.wikimedia. org
  • Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. 2023. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 (2023).
  • Daniel Goldberg. 2013. Geocoding Techniques and Technologies for Location- Based Services. In Advanced Location-Based Technologies and Services. CRC Press: Boca Raton, FL, 75–106.
  • Daniel W Goldberg. 2008. A geocoding best practices guide. (2008).
  • Daniel W. Goldberg. 2011. Improving Geocoding Match Rates with Spatially- Varying Block Metrics. Trans. GIS 15 (2011), 829–850.
  • Daniel W Goldberg. 2013. Geocoding techniques and technologies for location- based services. Advanced location-based technologies and services (2013), 75–106.
  • Daniel W Goldberg and Myles G Cockburn. 2010. Improving geocode accuracy with candidate selection criteria. Transactions in GIS 14, s1 (2010), 149–176.
  • Daniel W Goldberg, John P Wilson, and Craig A Knoblock. 2007. From text to geographic coordinates: the current state of geocoding. URISA journal 19, 1 (2007), 33–46.
  • Yassine Guermazi, Sana Sellami, and Omar Boucelma. 2022. A roberta based approach for address validation. In European Conference on Advances in Databases and Information Systems. Springer, 157–166.
  • Berkay Güler, Betül Aygün, Aydın Gerek, and Alaeddin Selçuk Gürel. 2023. Deep Active Learning for Address Parsing Tasks with BERT. In 2023 31st Signal Pro- cessing and Communications Applications Conference (SIU). IEEE, 1–4.
  • Yingjie Hu, Krzysztof Janowicz, and Sathya Prasad. 2014. Improving wikipedia- based place name disambiguation in short texts using structured data from dbpedia. In Proceedings of the 8th workshop on geographic information retrieval. 1–8.
  • Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
  • Matthew John Hutchinson. 2010. Developing an agent-based framework for intel- ligent geocoding. Ph.D. Dissertation. Curtin University.

Table 4: Evaluation results of baseline models

Address componentBiLSTM-CRFBERTroBERTaDistilBERTGPT-3
House number0.999770.999630.999760.999700.98810
Predirectional0.997190.993780.989960.995790.70077
Street base name0.992410.989630.980220.990610.83853
Road type0.997050.993450.985430.994270.88328
Postdirectional0.967390.966800.940030.967710.42917
City0.993990.992930.985390.993410.90404
State0.999930.999860.998940.999910.97505
Postal code1.000001.000000.999871.000000.97851
Overall F10.996770.995450.990840.995900.90875
Parsing Score148.37200148.16378147.39396148.24340133.32740
  • Geoffrey M Jacquez. 2012. A research agenda: does geocoding positional error matter in health GIS studies? Spatial and spatio-temporal epidemiology 3, 1 (2012), 7–16.
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980 (2014).
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).
  • Diya Li, Harshita Chaudhary, and Zhe Zhang. 2020. Modeling spatiotemporal pattern of depressive symptoms caused by COVID-19 using social media data mining. International Journal of Environmental Research and Public Health 17, 14 (2020), 4988.
  • Hao Li, Wei Lu, Pengjun Xie, and Linlin Li. 2019. Neural Chinese address pars- ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 3421–3431.
  • Lin Li, Wei Wang, Biao He, and Yu Zhang. 2018. A hybrid method for Chinese address segmentation. International Journal of Geographical Information Science 32, 1 (2018), 30–48.
  • Cheng-Lin Liu, Masashi Koga, and Hiromichi Fujisawa. 2002. Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 11 (2002), 1425–1437.
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Gengchen Mai, Chris Cundy, Kristy Choi, Yingjie Hu, Ni Lao, and Stefano Er- mon. 2022. Towards a foundation model for geospatial artificial intelligence (vision paper). In Proceedings of the 30th International Conference on Advances in Geographic Information Systems. 1–4.
  • Shreyas Mangalgi, Lakshya Kumar, and Ravindra Babu Tallamraju. 2020. Deep contextual embeddings for address classification in e-commerce. arXiv preprint arXiv:2007.03020 (2020).
  • Yolanda J. McDonald, Daniel W. Goldberg, Isabel C. Scarinci, Philip E. Castle, Jack Cuzick, Michael Robertson, and Cosette M. Wheeler. 2017. Health Service Accessibility and Risk in Cervical Cancer Prevention: Comparing Rural Versus Nonrural Residence in New Mexico. The Journal of Rural Health 4 (2017), 382–392.
  • Yolanda J McDonald, Daniel W Goldberg, Isabel C Scarinci, Philip E Castle, Jack Cuzick, Michael Robertson, and Cosette M Wheeler. 2017. Health service accessibility and risk in cervical cancer prevention: comparing rural versus nonrural residence in New Mexico. The Journal of Rural Health 33, 4 (2017), 382–392.
  • Shekoofeh Mokhtari, Ahmad Mahmoody, Dragomir Yankov, and Ning Xie. 2019. Tagging Address Queries in Maps Search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9547–9551.
  • Daniela Nuvolone, Roberto della Maggiore, Sara Maio, Roberto Fresco, Sandra Baldacci, Laura Carrozzi, Francesco Pistelli, and Giovanni Viegi. 2011. Geographi- cal information system and environmental epidemiology: a cross-sectional spatial analysis of the effects of traffic-related air pollution on population respiratory health. Environmental Health 10, 1 (2011), 12.
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
  • Wenqiao Sun. 2017. Chinese named entity recognition using modified conditional random field on postal address. In 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 1–6.
  • Tin Vu, Solluna Liu, Renzhong Wang, and Kumarswamy Valegerepura. 2020. Noise Prediction for Geocoding Queries using Word Geospatial Embedding and Bidirectional LSTM. In Proceedings of the 28th International Conference on Advances in Geographic Information Systems. 127–130.
  • Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
  • Marouane Yassine, David Beauchemin, François Laviolette, and Luc Lamontagne. 2021. Leveraging Subword Embeddings for Multinational Address Parsing. In 2020 6th IEEE Congress on Information Science and Technology (CiSt). IEEE, 353–360.
  • Zhengcong Yin, Daniel W Goldberg, Tracy A Hammond, Chong Zhang, Andong Ma, and Xiao Li. 2020. A probabilistic framework for improving reverse geocoding output. Transactions in GIS 24, 3 (2020), 656–680.
  • Zhengcong Yin, Andong Ma, and Daniel W Goldberg. 2019. A deep learning approach for rooftop geocoding. Transactions in GIS 23, 3 (2019), 495–514.
  • Paul A Zandbergen and Joseph W Green. 2007. Error and bias in determining exposure potential of children at school locations using proximity-based GIS techniques. Environmental Health Perspectives 115, 9 (2007), 1363.
  • Zhe Zhang, Zhangyang Wang, Angela Li, Xinyue Ye, E Lynn Usery, and Diya Li. 2021. An Al-based Spatial Knowledge Graph for Enhancing Spatial Data and Knowledge Search and Discovery. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Searching and Mining Large Collections of Geospatial Data. 13–17.
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In The IEEE International Conference on Computer Vision (ICCV).
  • Kate Zinszer, Christian Jauvin, Aman Verma, Lucie Bedard, Robert Allard, Kevin Schwartzman, Luc de Montigny, Katia Charland, and David L Buckeridge. 2010. Residential address errors in public health surveillance data: A description and analysis of the impact on geocoding. Spatial and Spatiotemporal Epidemiology 1, 2-3 (2010), 163–168.