{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# __Demonstrating the utility of machine learning innovations in address matching to spatial socio-economic applications__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Abstract" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last decade has heralded an unprecedented rise in the number, frequency and availability of data sources. Yet they are often incomplete, meaning data fusion is required to enhance their quality and scope. In the context of spatial analysis, address matching is critical to enhancing household socio-economic and demographic characteristics. Matching administrative, commercial, or lifestyle data sources to items such as household surveys has the potential benefits of improving data quality, enabling spatial data visualisation, and the lowering of respondent burden in household surveys. Typically when a practitioner has high quality data, unique identifiers are used to facilitate a direct linkage between household addresses. However, real-world databases are often absent of unique identifiers to enable a one-to-one match. Moreover, irregularities between the text representations of potential matches mean extensive cleaning of the data is often required as a pre-processing step. For this reason, practitioners have traditionally relied on two linkage techniques for facilitating matches between the text representations of addresses that are broadly divided into deterministic or mathematical approaches. Deterministic matching consists of constructing hand-crafted rules that classify address matches and non-matches based on specialist domain knowledge, while mathematical approaches have increasingly adopted machine learning techniques for resolving pairs of addresses to a match. In this notebook we demonstrate methods of the latter by demonstrating the utility of machine learning approaches to the address matching work flow. To achieve this, we construct a predictive model that resolves matches between two small datasets of restaurant addresses in the US. While the problem case may seem trivial, the intention of the notebook is to demonstrate an approach that is reproducible and extensible to larger data challenges. Thus, in the present notebook, we document an end-to-end pipeline that is replicable and instructive towards assisting future address matching problem cases faced by the regional scientist." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Road map\n", "\n", "1. [Packages and dependencies](#package_dependencies)\n", "2. [Data loading, cleaning and segmentation](#section_standardisation)\n", " 1. [Segmentation of address string into field columns](#section_segmentation)\n", "3. [Creation of candidate address pairs using a full index](#section_fullindex)\n", " 1. [Creation of comparison vectors from indexed addresses](#section_comp_vecs_full)\n", " 2. [Classification and evaluation of match performance](#section_classification_fullindex)\n", "4. [Creation of candidate address pairs by blocking on zipcode](#section_blocking)\n", " 1. [Creation of synthetic non-matched addresses](#section_synthetic_nonmatches)\n", " 2. [Blocking on postcode attribute](#section_blocking_postcode_attribute)\n", " 3. [Classification and evaluation of match performance](#section_evaluation)\n", "5. [Conclusion](#section_conclusion)\n", "6. [Bibliography](#section_bibliography)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our overarching objective is to demonstrate how machine learning can be integrated into the address matching work flow. By definition, address matching pertains to the process of resolving pairs of records with a spatial footprint. While geospatial matching links the geometric representations of spatial objects, address matching typically involves linking the text-based representations of address pairs. The utility of address matching, and record linkage in general, lies in the ability to unlock attributes from sources of data that cannot be linked by traditional means. This is often because the datasets lack a common key to resolve a join between the address of a premise. Two example applications of address matching uses include: the linkage of historical censuses across time for exploring economic and geographic mobility across multiple generations (Ruggles et al. 2018), and exploring how early-life hazardous environmental exposure, socio-economic conditions, or natural disasters impact the health and economic outcomes of individuals living in particular residential locations (Cayo & Talbot, 2003; Reynolds et al., 2003; Baldovin et al., 2015).\n", "\n", "For demonstrative purposes, we rely on small a set of addresses from the Fodors and Zagat restaurant guides that contain 112 matched addresses for training a predictive model that resolves address pairs to matches and non-matches. In a real-world application, training a machine learning model on a small sample of matched addresses could be used to resolve matches between the remaining addresses of a larger dataset. While we use the example of restaurant addresses, these could easily be replaced by addresses from a far less trivial source and the work flow required to implement the address matching exercise would remain the same. Therefore, it is the intention of this guide to provide insight on how the work flow of a supervised address matching work flow proceeds, and to inspire interested users to scale the supplied code to larger and more interesting problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Packages and dependencies" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import os\n", "import uuid\n", "import warnings\n", "from IPython.display import HTML\n", "\n", "# load external libraries\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import jellyfish\n", "import recordlinkage as rl\n", "import seaborn as sns\n", "from postal.parser import parse_address # CRF parser\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import cross_validate, train_test_split\n", "from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix\n", "\n", "# configure some settings\n", "np.random.seed(123)\n", "sns.set_style('whitegrid')\n", "pd.set_option('display.max_colwidth', -1)\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", "def hover(hover_color=\"#add8e6\"):\n", " return dict(selector=\"tbody tr:hover\",\n", " props=[(\"background-color\", \"%s\" % hover_color)])\n", "\n", "# table CSS\n", "styles = [\n", " #table properties\n", " dict(selector=\" \", \n", " props=[(\"margin\",\"0\"),\n", " (\"font-family\",'\"Helvetica\", \"Arial\", sans-serif'),\n", " (\"border-collapse\", \"collapse\"),\n", " (\"border\",\"none\"), (\"border-style\", \"hidden\")]),\n", " dict(selector=\"td\", props = [(\"border-style\", \"hidden\"), \n", " (\"border-collapse\", \"collapse\")]),\n", "\n", " #header color \n", " dict(selector=\"thead\", \n", " props=[(\"background-color\",\"#a4dbc8\")]),\n", "\n", " #background shading\n", " dict(selector=\"tbody tr:nth-child(even)\",\n", " props=[(\"background-color\", \"#fff\")]),\n", " dict(selector=\"tbody tr:nth-child(odd)\",\n", " props=[(\"background-color\", \"#eee\")]),\n", "\n", " #header cell properties\n", " dict(selector=\"th\", \n", " props=[(\"text-align\", \"center\"),\n", " (\"border-style\", \"hidden\"), \n", " (\"border-collapse\", \"collapse\")]),\n", "\n", " hover()\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data loading, cleaning and segmentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To begin our exercise we specify the file location that contains the entirety of the 112 Zagat and Fodor matched address pairs. This file can be downloaded from the dedicated Github repository that accompanies the paper (https://github.com/SamComber/address_matching_workflow) using the `wget` command." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2019-12-21 09:11:31-- https://raw.githubusercontent.com/SamComber/address_matching_workflow/master/zagat_fodor_matched.txt\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.56.133\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.56.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 19939 (19K) [text/plain]\n", "Saving to: ‘zagat_fodor_matched.txt’\n", "\n", "zagat_fodor_matched 100%[===================>] 19.47K --.-KB/s in 0.03s \n", "\n", "2019-12-21 09:11:32 (670 KB/s) - ‘zagat_fodor_matched.txt’ saved [19939/19939]\n", "\n" ] } ], "source": [ "! wget https://raw.githubusercontent.com/SamComber/address_matching_workflow/master/zagat_fodor_matched.txt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "f = 'zagat_fodor_matched.txt'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Address matching is principally a data quality challenge. Similar to other areas of data analysis, when the quality of input data to the match classification is low, the output generated will typically be of low accuracy (Christen, 2012). Problematically, most address databases we encounter in the real world are inconsistent, are missing of several values, and lack standardisation. Thus, a first step in the address matching work flow is to increase the quality of input data. In this way we increase the accuracy, completeness and consistency of our address records, which increases the ease in which they can be linked by the techniques we apply later on. Typically this stage begins by parsing the text representations of addresses into rows of a dataframe." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# load matched addresses, remove comment lines and reshape into two columns\n", "data = pd.read_csv(f, comment='#', \n", " header=None, \n", " names=['address']).values.reshape(-1, 2)\n", "\n", "matched_address = pd.DataFrame(data, columns=['addr_zagat', 'addr_fodor'])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "112 matched addresses loaded.\n" ] }, { "data": { "text/html": [ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
addr_zagataddr_fodor
0Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310-246-1501 SteakhousesArnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310/246-1501 American
1Art's Deli 12224 Ventura Blvd. Studio City 91604 818-762-1221 DelisArt's Delicatessen 12224 Ventura Blvd. Studio City 91604 818/762-1221 American
2Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 90077 310-472-1211 CalifornianHotel Bel-Air 701 Stone Canyon Rd. Bel Air 90077 310/472-1211 Californian
3Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818-788-3536 French BistroCafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
4Campanile 624 S. La Brea Ave. Los Angeles 90036 213-938-1447 CalifornianCampanile 624 S. La Brea Ave. Los Angeles 90036 213/938-1447 American
5Chinois on Main 2709 Main St. Santa Monica 90405 310-392-9025 Pacific New WaveChinois on Main 2709 Main St. Santa Monica 90405 310/392-9025 French
6Citrus 6703 Melrose Ave. Los Angeles 90038 213-857-0034 CalifornianCitrus 6703 Melrose Ave. Los Angeles 90038 213/857-0034 Californian
7Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood 90069 213-848-6677 French (New)Fenix 8358 Sunset Blvd. West Hollywood 90069 213/848-6677 American
8Granita 23725 W. Malibu Rd. Malibu 90265 310-456-0488 CalifornianGranita 23725 W. Malibu Rd. Malibu 90265 310/456-0488 Californian
9Grill The 9560 Dayton Way Beverly Hills 90210 310-276-0615 American (Traditional)Grill on the Alley 9560 Dayton Way Los Angeles 90210 310/276-0615 American
" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('{} matched addresses loaded.'.format(matched_address.shape[0]))\n", "matched_address.head(10).style.set_table_styles(styles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A series of data cleaning exercises will then modify the data in ways that support the application of the linkage techniques. This might involve writing data cleaning scripts that convert all letters to lowercase characters, delete leading and trailing whitespaces, remove unwanted characters and tokens such as punctuation, or using hard-coded look-up tables to find and replace particular tokens. All together coding these steps contributes towards a standard form between the two address databases the user is attempting to match. This is important because standards between the two sources of address data under consideration will typically differ due to different naming conventions.\n", "\n", "In the following cell blocks, we execute these steps by standardising our addresses. More specifically, we remove non-address components, convert all text to lower case and remove punctuation and non-alphanumeric characters." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# our rows contain non-address components such as phone number and \n", "# restaurant type so lets parse these using regular expressions into new columns\n", "zagat_pattern = r\"(?P
.*?)(?P\\b\\d{3}\\-\\d{3}\\-\\d{4}\\b)(?P.*$)\"\n", "fodor_pattern = r\"(?P
.*?)(?P\\b\\d{3}\\/\\d{3}\\-\\d{4}\\b)(?P.*$)\"\n", "\n", "matched_address[[\"addr_zagat\", \"phone_number_zagat\", \"category_zagat\"]] = matched_address[\"addr_zagat\"].str.extract(zagat_pattern)\n", "matched_address[[\"addr_fodor\", \"phone_number_fodor\", \"category_fodor\"]] = matched_address[\"addr_fodor\"].str.extract(fodor_pattern)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# standardise dataframe by converting all strings to lower case\n", "matched_address = matched_address.applymap(lambda row : row.lower() if type(row) == str else row)\n", "\n", "# remove punctuation and non-alphanumeric characters\n", "matched_address['addr_zagat'] = matched_address['addr_zagat'].str.replace('[^\\w\\s]','')\n", "matched_address['addr_fodor'] = matched_address['addr_fodor'].str.replace('[^\\w\\s]','')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Segmentation of address string into field columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After removing unwanted characters and tokens, our next step is to segment the entire address string into tagged attribute values. Addresses rarely come neatly formatted into sensible fields that identify each component, and so segmentation is a vital and often overlooked stage of the work flow. For example, an address might come in an unsegmented format such as \"19 Water St. New York 11201\". Our objective is then to segment (or label) this address into the appropriate columns for street number, street name, city and postcode. When we segment both sets of addresses from the datasets we intend to link, we build well-defined output fields that are suitable for matching. \n", "\n", "In our case we use a statistical segmentation tool called __Libpostal__ which is a Conditional Random Fields (CRFs) model trained on OpenStreetMap addresses. Before using the Python bindings, users are required to install the Libpostal C library first (see https://github.com/openvenues/pypostal#installation for installation instructions). CRFs are popular methods in natural language processing (NLP) for predicting sequence of labels across sequences of text inputs. Unlike discrete classifiers, CRFs model the probability of a transition between labels on \"neighbouring\" elements, meaning they take into account past and future address field states into the labelling of addresses into address fields. This mitigates a limitation of segmentation models such as hidden markov models (HMMs) called the _label bias problem_: \"transitions leaving a given state to compete only against each other, rather than against all transitions in the model\" (Lafferty et al., 2001). Take, for example, the business address for \"1st for Toys, 244 Ponce de Leon Ave. Atlanta 30308\". A naive segmentation model would incorrectly parse \"1st\" as a property number, whereas it actually completes the business name \"1st for Toys\", leading to an erroneous sequence of label predictions. When a CRFs has parsed \"1st\" and reaches the second token, \"for\", the model scores an $l\\times l$ matrix where $l$ is the maximal number of labels (or address fields) that can be assigned by the CRFs. In $L$, $l_{ij}$ reflects the probability of the current word being labelled as $i$ and the previous word labelled $j$ (Diesner & Carley, 2008). In a CRFs model, when the parser reaches the _actual_ property number, \"244\", high scoring in the matrix indicates the current label should be a property number, and the previous label revised to a business name. For a more detailed account, see Comber and Arribas-Bel (2019).\n", "\n", "To segment each address, we apply the `parse_address` function row-wise for both the Zagat and Fodors addresses. This generates a list of tuples (see below code block for an example of the first two addresses from the Zagat dataset) that we convert into dictionaries before finally reading these into a `pandas` dataframe." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[('arnie mortons of chicago', 'house'),\n", " ('435', 'house_number'),\n", " ('s la cienega blvd', 'road'),\n", " ('los angeles', 'city'),\n", " ('90048', 'postcode')],\n", " [('arts deli', 'house'),\n", " ('12224', 'house_number'),\n", " ('ventura blvd', 'road'),\n", " ('studio city', 'city'),\n", " ('91604', 'postcode')]]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[[('arnie mortons of chicago', 'house'),\n", " ('435', 'house_number'),\n", " ('s la cienega blvd', 'road'),\n", " ('los angeles', 'city'),\n", " ('90048', 'postcode')],\n", " [('arts deli', 'house'),\n", " ('12224', 'house_number'),\n", " ('ventura blvd', 'road'),\n", " ('studio city', 'city'),\n", " ('91604', 'postcode')]]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# parse address string using libpostal CRF segmentation tool\n", "addr_zagat_parse = [parse_address(addr, country='us') for addr in matched_address.addr_zagat]\n", "addr_fodor_parse = [parse_address(addr, country='us') for addr in matched_address.addr_fodor]\n", "\n", "# convert to pandas dataframe\n", "addr_zagat_parse = pd.DataFrame.from_records([{k: v for v, k in row} for row in addr_zagat_parse]).add_suffix('_zagat')\n", "addr_fodor_parse = pd.DataFrame.from_records([{k: v for v, k in row} for row in addr_fodor_parse]).add_suffix('_fodor')\n", "\n", "# vertical join of CRF-parsed addresses between both dataframes\n", "matched_address = matched_address.join(addr_zagat_parse).join(addr_fodor_parse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given we know the match status of our training data, we can safely join the records back together once we have successfully segmented them. Moreover, as we know the match status in advance, we can assign unique IDs that we will use later to create a binary variable for indicating whether an address pair is matched or non-matched." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# create unique ID for matched addresses, these will be used later to create a match status\n", "uids = [str(uuid.uuid4()) for i in matched_address.iterrows()]\n", "\n", "# the following two lines will assign the same uid to both columns, thus facilitating a match\n", "addr_zagat_parse['uid'], addr_fodor_parse['uid'] = uids, uids\n", "match_ids = pd.DataFrame({'zagat_id' : addr_fodor_parse['uid'], 'fodor_id' : addr_fodor_parse['uid']})" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
addr_zagataddr_fodorphone_number_zagatcategory_zagatphone_number_fodorcategory_fodorcity_zagatcity_district_zagathouse_zagathouse_number_zagatpostcode_zagatroad_zagatsuburb_zagatcity_fodorcity_district_fodorhouse_fodorhouse_number_fodorpostcode_fodorroad_fodorstate_fodorsuburb_fodorzagat_idfodor_id
0arnie mortons of chicago 435 s la cienega blvd los angeles 90048 arnie mortons of chicago 435 s la cienega blvd los angeles 90048 310-246-1501 steakhouses310/246-1501 americanlos angelesnanarnie mortons of chicago43590048s la cienega blvdnanlos angelesnanarnie mortons of chicago43590048s la cienega blvdnannan99bbcd03-ce45-40b5-907f-47f5ae16ae2999bbcd03-ce45-40b5-907f-47f5ae16ae29
1arts deli 12224 ventura blvd studio city 91604 arts delicatessen 12224 ventura blvd studio city 91604 818-762-1221 delis818/762-1221 americanstudio citynanarts deli1222491604ventura blvdnanstudio citynanarts delicatessen1222491604ventura blvdnannan1b1e1ee1-c880-4722-abaa-4ec44c7d94a61b1e1ee1-c880-4722-abaa-4ec44c7d94a6
2belair hotel 701 stone canyon rd bel air 90077 hotel belair 701 stone canyon rd bel air 90077 310-472-1211 californian310/472-1211 californiannannanbelair hotel70190077stone canyon rd bel airnannannanhotel belair70190077stone canyon rd bel airnannanf2548f68-2326-4706-bdc1-dbfc265ecbf3f2548f68-2326-4706-bdc1-dbfc265ecbf3
3cafe bizou 14016 ventura blvd sherman oaks 91423 cafe bizou 14016 ventura blvd sherman oaks 91423 818-788-3536 french bistro818/788-3536 frenchsherman oaksnancafe bizou1401691423ventura blvdnansherman oaksnancafe bizou1401691423ventura blvdnannan936687e2-1161-4ecd-98c3-b5ac620d8776936687e2-1161-4ecd-98c3-b5ac620d8776
4campanile 624 s la brea ave los angeles 90036 campanile 624 s la brea ave los angeles 90036 213-938-1447 californian213/938-1447 americanlos angelesnancampanile62490036s la brea avenanlos angelesnancampanile62490036s la brea avenannan20a05006-d08e-4245-b014-ab2d0f55266220a05006-d08e-4245-b014-ab2d0f552662
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# join match ids to main dataframe\n", "matched_address = matched_address.join(match_ids)\n", "\n", "# preview of our parsed dataframe with uids assigned\n", "matched_address.head().style.set_table_styles(styles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creation of candidate address pairs using a 'full index'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once our addresses have met a particular standard of quality and are segmented into the desired address fields, the next step requires us to create candidate pairs of addresses that potentially resolve to the same address. In record linkage, this step is typically called indexing or blocking, and is required to reduce the number of address pairs that are compared. In doing so we remove pairs that are unlikely to resolve to true matches. To demonstrate the utility of blocking and why it is so important to address matching, we first create a __full index__ which creates all possible combinations of address pairs. More concretely, a full index generates the Cartesian product between both sets of addresses. Conditional on the size of both dataframes, full blocking is highly computationally inefficient, and in our case we create $112\\times 112 = 12544$ candidate links; this has a complexity of $O(n^2)$. We demonstrate the full index method to motivate the desire for practitioners to implement more sophisticated blocking techniques. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Full index " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we instantiate an `Index` class before specifying the desired full index method for generating pairs of records. We then create the Cartesian join between the Zagat and Fodor addresses which creates a `MultiIndex` that links every Zagat address with every Fodor address." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.\n" ] } ], "source": [ "indexer = rl.Index()\n", "indexer.full()\n", "\n", "# create cartesian join between zagat and fodor restaurant addresses\n", "candidate_links = indexer.index(matched_address.city_zagat, matched_address.city_fodor)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12544 candidate links created using full indexing.\n" ] } ], "source": [ "# this creates a two-level multiindex, so we name addresses from the zagat and fodor databases, respectively.\n", "candidate_links.names = ['zagat', 'fodor']\n", "\n", "print('{} candidate links created using full indexing.'.format(len(candidate_links)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In practice, a full index creates a dataframe with 12,544 rows and thus creates candidate address pairs between every possible combination of address from both the Zagat and Fodor datasets. Once we generate this dataframe of potential matches, we create a match status column and assign a 1 to actual matched addresses and 0 to non-matches based on the unique IDs created earlier." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# lets create a function we can reuse later on\n", "def return_candidate_links_with_match_status(candidate_links):\n", " \n", " # we return a vector of label values for both the zagat and fodor restaurant IDs from the multiindex\n", " zagat_ids = candidate_links.get_level_values('zagat')\n", " fodor_ids = candidate_links.get_level_values('fodor')\n", "\n", " # now we create a new dataframe as long as the number of candidate links\n", " zagat = matched_address.loc[zagat_ids][['city_zagat', 'house_zagat',\\\n", " 'house_number_zagat', 'road_zagat', 'suburb_zagat', 'zagat_id']]\n", " fodor = matched_address.loc[fodor_ids][['city_fodor','house_fodor', 'house_number_fodor',\\\n", " 'road_fodor', 'suburb_fodor', 'fodor_id']]\n", "\n", " # vertically concateate addresses from both databases\n", " candidate_link_df = pd.concat([zagat.reset_index(drop=True), fodor.reset_index(drop=True)], axis=1)\n", "\n", " # next we create a match status column that we will use to train a machine learning model\n", " candidate_link_df['match_status'] = np.nan\n", "\n", " # assign 1 for matched, 0 non-matched\n", " candidate_link_df.loc[candidate_link_df['zagat_id'] == candidate_link_df['fodor_id'], 'match_status'] = 1.\n", " candidate_link_df.loc[ ~(candidate_link_df['zagat_id'] == candidate_link_df['fodor_id']), 'match_status'] = 0.\n", " \n", " return candidate_link_df\n", "\n", "candidate_link_df = return_candidate_links_with_match_status(candidate_links)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creation of comparison vectors from indexed addresses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To resolve addresses into matches and non-matches we generate comparison vectors between each candidate address pair. Each element of this comparison vector is a similarity metric used to assess the closeness of two address fields. In our case, we use __Jaro-Winkler similarity__ because it has been observed to perform best on attributes containing named values (e.g., property names, street names, or city names) (Christen, 2012; Yancey, 2005). The Jaro similarity of two given address components $a_1$ and $a_2$ is given by\n", "\n", "$$\n", "jaro\\_sim =\\left\\{\n", " \\begin{array}{ll}\n", " 0 \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ if \\ \\ m = 0\\\\\n", " \\frac{1}{3} (\\frac{m}{|a_1|} + \\frac{m}{|a_2|} + \\frac{m-t}{m}) \\ \\ otherwise\n", " \\end{array}\n", " \\right.\n", " \\\n", "$$\n", "\n", "where $|a_i|$ is the length of the address component string $a_i$, $m$ is the number of matching characters, and $t$ is the number of transpositions required to match the two address components. We will create a function that makes use of the `jellyfish` implementation of Jaro-winkler similarity. Several other string similarity metrics are available and are optimised for particular use cases and data types. See Chapter 5 of Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen for an excellent overview." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def jarowinkler_similarity(s1, s2):\n", " \n", " conc = pd.concat([s1, s2], axis=1, ignore_index=True)\n", " \n", " def jaro_winkler_apply(x):\n", " \n", " try:\n", " return jellyfish.jaro_winkler(x[0], x[1])\n", " # raise error if fields are empty\n", " except Exception as err:\n", " if pd.isnull(x[0]) or pd.isnull(x[1]):\n", " return np.nan\n", " else:\n", " raise err\n", " \n", " # apply row-wise to concatenated columns\n", " return conc.apply(jaro_winkler_apply, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before applying Jaro-Winkler similarity we need to choose columns that were segmented in __both__ the Zagat and Fodor datasets." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['city_zagat', 'house_zagat', 'house_number_zagat', 'road_zagat',\n", " 'suburb_zagat', 'zagat_id', 'city_fodor', 'house_fodor',\n", " 'house_number_fodor', 'road_fodor', 'suburb_fodor', 'fodor_id',\n", " 'match_status'],\n", " dtype='object')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# lets take a look at the columns we have available\n", "candidate_link_df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can only match columns that were parsed in both address datasets, this means we lose two columns, `city_district_zagat` and `state_fodor`, that were parsed by the CRF segmentation model. Once we observe which address fields are common to both datasets, we create so-called comparison vectors from candidate address pairs of the Zagat and Fodor datasets. Each element of the comparison vector represents the string similarity between address fields contained in both databases. For example, `city_jaro` describes the string similarity between the columns `city_zagat` and `city_fodor`. Looking at the first two rows of our comparison vectors dataframe, a `city_jaro` value of 1.00 implies an exact match whereas a value of 0.4040 implies a number of modifications are required to match the two city names, and so these are less likely to correspond to a match." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
city_jarohouse_jarohouse_number_jaroroad_jarosuburb_jaromatch_status
0111101
10.404040.56830100.62908500
200.48214300.67407700
30.6262630.5027780.5111110.62908500
410.4546300.83149300
" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a function for building comparison vectors we can reuse later\n", "def return_comparison_vectors(candidate_link_df):\n", " \n", " candidate_link_df['city_jaro'] = jarowinkler_similarity(candidate_link_df.city_zagat, candidate_link_df.city_fodor)\n", " candidate_link_df['house_jaro'] = jarowinkler_similarity(candidate_link_df.house_zagat, candidate_link_df.house_fodor)\n", " candidate_link_df['house_number_jaro'] = jarowinkler_similarity(candidate_link_df.house_number_zagat, candidate_link_df.house_number_fodor)\n", " candidate_link_df['road_jaro'] = jarowinkler_similarity(candidate_link_df.road_zagat, candidate_link_df.road_fodor)\n", " candidate_link_df['suburb_jaro'] = jarowinkler_similarity(candidate_link_df.suburb_zagat, candidate_link_df.suburb_fodor)\n", "\n", " # now we build a dataframe that contains the jaro-winkler similarity between the address components and the matching status\n", " comparison_vectors = candidate_link_df[['city_jaro', 'house_jaro', 'house_number_jaro',\\\n", " 'road_jaro', 'suburb_jaro', 'match_status']]\n", " \n", " # set NaN values to 0 so the comparison vectors can work with the applied classifiers\n", " comparison_vectors = comparison_vectors.fillna(0.)\n", " \n", " return comparison_vectors\n", "\n", "comparison_vectors = return_comparison_vectors(candidate_link_df)\n", "\n", "# lets preview this dataframe to build some intution as to how it looks\n", "comparison_vectors.head().style.set_table_styles(styles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classification and evaluation of match performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we obtain comparison vectors for each candidate address pair, we frame our approach as a binary classification problem by resolving the vectors into matches and non-matches. As the Zagat and Fodors dataframe has labels that describe our address pairs as matched, we use supervised classification to train a statistical model, a __random forest__, to classify address pairs with an unknown match status into matches and non-matches. As a reminder, a random forest is generated using a multitude of decision trees during training which then outputs the mode of the match status decision for the individual trees. \n", "\n", "In practice, we initialize a random forest object and split our `comparison_vectors` dataframe into features containing our Jaro-Winkler string similarity features, $X$, and a vector used to predict match status of the addresses, $y$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a random forest classifier that uses 100 trees and number of cores equal to those available on machine\n", "rf = RandomForestClassifier(n_estimators = 100, \n", " # Due to small number of features (5) we do not limit depth of trees\n", " max_depth = None, \n", " # max number of features to evaluate split is sqrt(n_features)\n", " max_features = 'auto', \n", " n_jobs = os.cpu_count())\n", "\n", "# define metrics we use to assess the model\n", "scoring = ['precision', 'recall', 'f1']\n", "folds = 10\n", "\n", "# extract the jaro-winkler string similarity and match label\n", "X = comparison_vectors.iloc[:, 0:5]\n", "y = comparison_vectors['match_status']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate the performance of our built classification model, we use 10-fold cross-validation meaning the performance measures are averaged across the test sets used within the 10 folds. We use three metrics that are commonly used to evaluate machine learning models. Recall measures the proportion of address pairs that should have been classified, or recalled, as matched (Christen, 2012). The precision (or, equivalently, the positive predictive value) calculates the proportion of the matched address pairs that are classified correctly as true matches (Christen, 2012). Finally, the F1 score reflects the harmonic mean between precision and recall. Our cross-validation exercise is executed in the following cell." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# 10-fold cross-validation procedure\n", "scores = cross_validate(estimator = rf,\n", " X = X,\n", " y = y,\n", " cv = folds, \n", " scoring = scoring,\n", " return_train_score = False)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean precision score is 0.9546 over 10 folds.\n", "Mean recall score is 0.928 over 10 folds.\n", "Mean F1 score is 0.9383 over 10 folds.\n" ] } ], "source": [ "print('Mean precision score is {} over {} folds.'.format( np.round(np.mean(scores['test_precision']), 4), folds))\n", "print('Mean recall score is {} over {} folds.'.format( np.round(np.mean(scores['test_recall']), 4), folds))\n", "print('Mean F1 score is {} over {} folds.'.format( np.round(np.mean(scores['test_f1']), 4), folds))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall, the high precision value implies that 95% of true positives are successfully disambiguated from false positives. Moreover, our recall value implies that 93% of all potential matches were successfully returned, with the remaining 7% of correct matches incorrectly labelled as false negatives. Given the high values in both of these metrics, the accompanying F1 score is equally high." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creation of candidate address pairs by blocking on zipcode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While a Cartesian product could be useful in a linkage exercise where we have a very small number of matched addresses, in production environments more sophisticated techniques are generally required to create candidate address links. This is particularly the case when we have a large number of addresses. Thus, blocking is typically introduced to partition the set of all possible address comparisons to within mutually exclusive blocks. If we let $b$ equal the number of blocks, we reduce the complexity of the comparison exercise to $O(\\frac{n^2}{b})$, which is far more computationally tractable than the full index method used above.\n", "\n", "When deciding which column to use as a blocking key we generally need pay attention to two main considerations. Firstly, we pay attention to attribute data quality. Typically when identifying a blocking key, we choose a key that has a __low number of missing values__. This is because choosing a key with many missing values forces a large number of addresses into a block where the key is an empty value, which may lead to many misclassified address matches. And secondly we pay attention to the __frequency distribution__ of attribute values. We optimise towards a uniform distribution of values, as typically skewed distributions that result in some values occurring very frequently mean that these values will dominate the candidate pairs of address generated. \n", "\n", "These considerations are addressed in the following two code blocks." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Missing postcodes for Zagat addresses: 1. \n", "Missing postcodes for Fodor addresses: 0.\n" ] } ], "source": [ "print(\"Missing postcodes for Zagat addresses: {}. \\nMissing postcodes for Fodor addresses: {}.\"\n", " .format(matched_address.postcode_zagat.isnull().sum(), matched_address.postcode_fodor.isnull().sum()))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlcAAAF8CAYAAADiuJ7sAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzs3XtcVGXiBvDnzAzDbRAYrnJRkKvg\nBUHMO5qYXcxum6Ir22Vrt83V2s2trDbd1dzq89t2u2m7bu2ulUUqZbWZlzI1RdFRRERREBBRrnKR\nOwzz+4NlNlMZkBnemTnP9/PxQ8NwZp5zAnk873veIxkMBgOIiIiIyCwUogMQERER2ROWKyIiIiIz\nYrkiIiIiMiOWKyIiIiIzYrkiIiIiMiOWKyIiIiIzUokO0E2n04mOQERERNRrCQkJ1/y81ZQr4Poh\nb5ROpzP7a9oauR8D7r+89x/gMZD7/gM8BnLff8Ayx6Cnk0IcFiQiIiIyI5YrIiIiIjNiuSIiIiIy\nI5YrIiIiIjNiuSIiIiIyI5YrIiIiIjNiuSIiIiIyI5YrIiIiIjNiuSIiIiIyI5YrIiIiIjNiuSIi\nIiIyI5YrEqpd346vznyFszVnRUchIiIyC5YrEiajJAPxf4/HHRvuQMSbEbjvk/uwv2S/6FhERET9\nwnJFA66muQaPffkYJr43ETkVOZg5bCbCteFIP5mOSe9NwoR3J2BT7iboO/WioxIREfWZSnQAkpeP\ncz7Gk18/ifLGcoR4hOC3E36Lkb4jYTAYcKz8GDbmbkRGSQbu33g/5sXOw0f3fQRJkkTHJiIi6jWW\nKxow64+txwOfPQC1Uo1H4h/B3Ji5cFA6AAAkSUKcfxzi/ONQUleCV/a9grQTabg1/FY8GPeg2OBE\nRER9wGFBGhD1rfV4esfTcFQ6Yt2d6/DTkT81FqsfC3YPxgtTX4CrgysWb13Mye5ERGRTWK5oQLy0\n5yWUN5Zj/sj5GOI+xOTX+2v88cRNT6ChrQGpn6aio7NjAFISERH1H8sVWdyZ6jP4y4G/wM/VDymx\nKb3eLnlYMqaHTMf+kv145ftXLJiQiIjIfFiuyOKW7liK9s52PDb2MTiqHHu9nSRJ+M3438DHxQcr\ndq/AodJDFkxJRERkHixXZFHbC7bj87zPMdpvNJKGJvV5ezdHNzwz+Rl0dHZg4acL0djWaIGURERE\n5sNyRRbTrm/Hk18/CYWkwK/H/fqGl1RIGJyA+2Pux+nq01j2zTIzpyQiIjIvliuymLWH1+Jk1UnM\njpiNcG14v17rkfhHEOgWiL/p/oayhjIzJSQiIjI/liuyiMrGSiz/bjk0ag0eHvNwv19PrVRjbuxc\ntOnbsObQGjMkJCIisgyWK7KINzPfRG1LLR4Y/QDcndzN8pq3hN2CQY6DsPbwWjS3N5vlNYmIiMyN\n5YrMzmAw4P3s9+GscsbsyNlme10nlRPujLwTVU1V+CD7A7O9LhERkTmxXJHZ7S/Zj6LaIkwZOgVO\nKiezvvbd0XdDpVDhLwf+AoPBYNbXJiIiMgeWKzK77rNKM4fNNPtre7t4Y3rIdJysOoltBdvM/vpE\nRET9xXJFZtWmb0PaiTRonbUY4z/GIu9xf8z9AIDXMl6zyOsTERH1B8sVmdXWM1tR01KDGaEzoFQo\nLfIeEV4RiPOPw46zO5BTkWOR9yAiIrpRLFdkVh8ct9yQ4A91n736S8ZfLPo+REREfcVyRWZT21KL\nL/K+QIh7SL8XDTVlfNB4BA0KwgfHP0B5Q7lF34uIiKgvWK7IbDbnbkarvhXJYck3fKub3lJICtw3\n/D606duw9vBai74XERFRX7Bckdl0DwnOCJ0xIO83K2wW3NRuWHt4LTo6OwbkPYmIiExhuSKzOFd3\nDt8VfYdRfqPgr/EfkPd0dnDGzaE3o6KxArsKdw3IexIREZnCckVmseH4BgCWn8j+YzeH3gwA+Djn\n4wF9XyIiouthuaJ+677djYPCAUlDkwb0vUf4joCPiw/ST6WjtaN1QN+biIjoWliuqN+OlR9DbmUu\nxgeNh5uj24C+t0JSYHrIdNS21GJ7wfYBfW8iIqJrYbmifvvo+EcABn5IsFv30OBHOR8JeX8iIqIf\nYrmifvu64GuolWqMCxwn5P0jvSIR4BaALXlb0NjWKCQDERFRN5Yr6pfyhnJkl2djpO9IOKochWSQ\nJAk3h9yMpvYm/OfMf4RkICIi6sZyRf2y8+xOAEBCQILQHLxqkIiIrAXLFfXLjrM7AABjB48VmiPU\nMxShHqH46sxXqGupE5qFiIjkjeWKbpjBYMCOszvg7uiOMG2Y6Di4OfRmtOpb8dmpz0RHISIiGetV\nuVq9ejXmzZuHlJQUZGdnX/HcgQMHMHfuXKSkpGDZsmXo7OxETk4Opk6ditTUVKSmpmLlypUWCU9i\nnaw6iQuXLyB+cDwUkviePj1kOgDg4xMcGiQiInFUpr4gMzMTxcXFSEtLQ35+PpYtW4aNGzcan3/x\nxRexfv16+Pv7Y8mSJdi7dy+cnZ0xa9YsPP/88xYNT2LtKOgaEhQ936pb4KBARHtFY0fBDlQ1VcHb\nxVt0JCIikiGTpxsyMjKQnJwMAAgPD0d9fT0aGhqMz6enp8Pfv+teclqtFjU1NWhs5OXwcrCzsGsy\nu+j5Vj80PXQ69AY9NuduFh2FiIhkymS5qqqqgqenp/Gxl5cXKisrjY81Gg0AoKKiAvv370dSUhKa\nmpqg0+nwyCOP4Kc//SkOHDhggegkUru+Hd8VfYfgQcHw0/iJjmPUPTTIBUWJiEgUk8OCBoPhqseS\nJF3xuerqajz22GN48cUX4enpiejoaCxatAgzZsxAYWEhHnroIWzfvh1qtbrH99LpdDewCz2zxGva\nGkscg6PVR9HQ1oAEzwTk5eWZ/fX7I9wtHHuK92D7/u3wcvSS/feA3Pcf4DGQ+/4DPAZy339gYI+B\nyXLl5+eHqqoq4+OKigp4e/9vLktDQwMeffRRPPHEE5g8eTIAICwsDGFhXVePhYaGwtvbG+Xl5QgO\nDu7xvRISzDt3R6fTmf01bY2ljsGWXVsAAMkxyYgaEmX21++PmR0zkX84H6XOpfDq9JL19wB/BngM\n5L7/AI+B3PcfsMwx6KmsmRwWnDRpErZt2wYAyM3Nha+vr3EoEABefvllPPDAA0hKSjJ+btOmTVi/\nfj0AoLKyEtXV1fDzs56hI+q/HWd3QCkpEecfJzrKVSYETQAAfHnmS8FJiIhIjkyeuYqPj0dsbCxS\nUlIgSRKWL1+O9PR0uLm5YfLkyfjss89QXFyMTZs2AQBmz56NW2+9FUuXLsW2bdvQ1taGFStWmBwS\nJNtR21KLzNJMDPceDo1aY3qDARbsHoygQUHYXrAdS0OWio5DREQyY7JcAcDSpVf+goqOjjb+d05O\nzjW3WbduXT9ikTXbVbgLnYZOjA2wnqsEf2x80Hhsyt2EI5eOYAImiI5DREQyIn7lR7I53be8sZb1\nra5lYtBEAMDe8r2CkxARkdywXFGf7Ti7Ay4OLhjuPVx0lOsa6TcSrg6u2Fu+96orXomIiCyJ5Yr6\npKi2CPmX8hHnHweVolejykKoFCokBibiQvMF5Fbmio5DREQywnJFfdJ9yxtrWpX9erqHBr88zasG\niYho4LBcUZ98U/gNAOueb9VtXOA4SJDwxekvREchIiIZYbmiPvn+3PfQOmkRPKjnBWGtgbuTO4a5\nDUPG+QxUNVWZ3oCIiMgMWK6o10rqSlB6uRQxPjFX3QLJWo30GIlOQye+zv9adBQiIpIJlivqtYzz\nGQCAWN9YwUl6b5TnKADg0CAREQ0Ylivqtf0l+wEAsT62U64GOw+Gv8YfX+d/jXZ9u+g4REQkAyxX\n1GsZ5zOglJSI9IoUHaXXJEnChKAJqG+tx/fnvhcdh4iIZIDlinqlub0ZRy4eQYRXBBxVjqLj9En3\njZw5NEhERAOB5Yp6RXdRh47ODpsaEuw22n80nFXOXO+KiIgGBMsV9UpGie1NZu+mVqoxNmAszlw6\ng9PVp0XHISIiO8dyRb2y/7ztTWb/oXGB4wAA2/K3CU5CRET2juWKTDIYDMgoyYCPiw98XX1Fx7kh\nYwO6btezrYDlioiILIvlikwqrC1EeWM5YnxiREe5Yf4afwQPCsauol1o7WgVHYeIiOwYyxWZZJxv\nZaNDgt0SAxPR1N5kXK+LiIjIEliuyCRbXJn9WhIDEgFwaJCIiCyL5YpM2l+yH2qlGhHaCNFR+mW0\n32g4KBxYroiIyKJYrqhHDW0NyC7PRqRXJByUDqLj9IuzgzNG+o1EVlkWyhvKRcchIiI7xXJFPTpU\negh6g97m51t1675qcHvBdsFJiIjIXrFcUY+M863spFx1z7vafpblioiILIPlinpkL5PZu4V5hkHr\nrMX2gu3oNHSKjkNERHaI5Yquq3vxUH+NP7TOWtFxzEKSJIwNGIuKxgocKzsmOg4REdkhliu6rjOX\nzqC6udpuhgS7cUkGIiKyJJYruq7uxTbtrVwlDE4AwHJFRESWwXJF12Vcmd1O5lt183T2RIQ2AvvO\n7UNDW4PoOEREZGdYrui6Ms5nwEnlhDDPMNFRzC4xMBHtne34rug70VGIiMjOsFzRNTW1N+FE5QmE\na8OhVChFxzE747yrfA4NEhGRebFc0TUdKzuGTkMnoryiREexiFifWDirnDnvioiIzI7liq5Jd1EH\nAIj0ihScxDIclA4Y4z8GZy6dQWFNoeg4RERkR1iu6JoOXzgMAHZ75goAxgbyVjhERGR+LFd0TbqL\nOjirnBE0KEh0FIsZO7irXO04u0NwEiIisicsV3SVxrZG5Fbm2u1k9m5Bg4Lg5+qHbwu/hb5TLzoO\nERHZCZYrusqxcvuezN5NkiQkBCSgpqXGOMeMiIiov1iu6Cq6C/Y9mf2Huldr31HAoUEiIjIPliu6\nyuGL/53M7m3fZ66ArnIlQeK8KyIiMhuWK7qK7oL9T2bv5u7kjnBtOPaX7OetcIiIyCxYrugKjW2N\nOFl1EhHaCCgkeXx7jA0Yi/bOduwp3iM6ChER2QF5/PakXssqy0KnoROR3vY/36pbQgDnXRERkfmw\nXNEV7H1l9msZ6TsSaqWa866IiMgsWK7oCnJYmf3H1Eo1RvmNwonKE7h4+aLoOEREZONYrugKcliZ\n/Vq6V2vfeXan4CRERGTrelWuVq9ejXnz5iElJQXZ2dlXPHfgwAHMnTsXKSkpWLZsGTo7O01uQ9ap\noa0Bp6pOIcJLPpPZu40N4K1wiIjIPFSmviAzMxPFxcVIS0tDfn4+li1bho0bNxqff/HFF7F+/Xr4\n+/tjyZIl2Lt3L5ydnXvchqxT92R2OQ0JdhvmOQxaJy12nt0Jg8EASZJERyIiIhtl8vRERkYGkpOT\nAQDh4eGor69HQ8P/1gNKT0+Hv78/AECr1aKmpsbkNmSd5LQy+49JkoT4gHhcbLiIE5UnRMchIiIb\nZrJcVVVVwdPT0/jYy8sLlZWVxscajQYAUFFRgf379yMpKcnkNmSduldml2O5AngrHCIiMg+Tw4IG\ng+Gqxz8eMqmursZjjz2GF198EZ6enr3a5lp0OvPfPNcSr2lrensM9p3dByelExovNiKvLM/CqQZO\nXl7v9sWjzQMAsOnoJkxVT7VkpAHFnwEeA7nvP8BjIPf9Bwb2GJgsV35+fqiqqjI+rqiogLe3t/Fx\nQ0MDHn30UTzxxBOYPHlyr7a5noSEhD6FN0Wn05n9NW1Nb49BQ1sDir4swii/URgePXwAkg2MvLw8\nREX1fg7Z0IKhyKrNwojRI+CocrRgsoHBnwEeA7nvP8BjIPf9ByxzDHoqayaHBSdNmoRt27YBAHJz\nc+Hr62scCgSAl19+GQ888ACSkpJ6vQ1Zn6MXj8IAgywns//Q2ICxaGpvQsb5DNFRiIjIRpk8cxUf\nH4/Y2FikpKRAkiQsX74c6enpcHNzw+TJk/HZZ5+huLgYmzZtAgDMnj0b8+bNu2obsm5yXJn9WhIC\nErD55GbsKNiBaSHTRMchIiIbZLJcAcDSpUuveBwdHW3875ycnF5tQ9ate2V2uZerOL84qBQqbD+7\nHS/NeEl0HCIiskHyWimSrkt3UQdXB1cEDgoUHUUoZwdnjPAdAd0FHaqaqkxvQERE9CMsV4SGtgbk\nVeUhXBsuu5XZr2VswFgYYMA3Z78RHYWIiGwQf5MSssuzYYABEdoI0VGsQvetcLYXbBechIiIbBHL\nFeHoxaMAgHCvcMFJrEOENgLuju7Yfnb7VWu2ERERmcJyRcgqywIAnrn6L4WkQMLgBJyvP49TVadE\nxyEiIhvDckU4WnYUaqUaQ9yHiI5iNTg0SEREN4rlSuba9e04XnEcoR6hUCl6tTKHLBjL1VmWKyIi\n6huWK5k7WXUSbfo2hGs53+qHfFx9MNR9KL4r+g6tHa2i4xARkQ1huZI542R2lqurJAYkoqm9CftL\n9ouOQkRENoTlSuY4mf36OO+KiIhuBMuVzB0tOwoJEoZ5DhMdxeqM8hsFB4UD510REVGfsFzJmMFg\nQFZZFoIGBcHZwVl0HKvTfSucIxePoLKxUnQcIiKyESxXMlZYW4i61joOCfage2hw59mdgpMQEZGt\nYLmSMa7MbhqXZCAior5iuZIxTmY3LVwb3nUrnALeCoeIiHqH5UrGjpZxGQZTFJICCQEJuHD5AnIr\nc0XHISIiG8ByJWNHy47C28UbHk4eoqNYtcSARADAtoJtgpMQEZEtYLmSqYrGCly4fIFDgr3QPe/q\n6/yvBSchIiJbwHIlU1yZvfe8XbwR5hmG3cW70djWKDoOERFZOZYrmeJk9r65KfAmtOnbsKtol+go\nRERk5ViuZKp7MnuEF8tVb4wLHAcA2Hpmq+AkRERk7ViuZOpo2VFo1Br4ufqJjmITYn1j4ergiq35\nW7kkAxER9YjlSoYa2hpwpvoMwrXhkCRJdByboFKokBCQgMLaQpyuPi06DhERWTGWKxk6VnYMBhg4\nmb2PjEOD+RwaJCKi62O5kiFOZr8x4wJYroiIyDSWKxniyuw3xsfVp2tJhqLdaGpvEh2HiIisFMuV\nDB0tOwq1Uo0h7kNER7E5NwXehFZ9K3YVckkGIiK6NpYrmWnXtyOnIgehHqFQKVSi49ic7nlXX535\nSnASIiKyVixXMnOy6iTa9G0cErxBXJKBiIhMYbmSGd72pn+4JAMREZnCciUzxpXZeaXgDeOSDERE\n1BOWK5nJKsuCBAnDPIeJjmKzuCQDERH1hOVKRgwGA7LKshA0KAjODs6i49gsH1cfDPMcxiUZiIjo\nmliuZKSwthB1rXUcEjSD8YHjuSQDERFdE8uVjHSvzB7uxcns/cUlGYiI6HpYrmTEeKWgJ8tVf8X6\nxkKj1uDLM19ySQYiIroCy5WMGK8U9OKwYH+pFCrcFHgTztWdQ3Z5tug4RERkRViuZCSrLAveLt7w\ncPIQHcUuTAyeCADYkrdFcBIiIrImLFcyUdlYidLLpVw81IzGBY6DSqHC53mfi45CRERWhOVKJrh4\nqPlp1BrE+cVBd1GH8/XnRcchIiIrwXIlE8YrBXnmyqy6hwa/yPtCcBIiIrIWLFcy0X3miuXKvLrL\n1eenOTRIRERdWK5k4ujFo9A4aDBYM1h0FLvip/FDuDYc3xZ+i8utl0XHISIiK8ByJQONbY04XX0a\nYdowSJIkOo7dmRg8EW36Nmwr2CY6ChERWYFelavVq1dj3rx5SElJQXb2lWv6tLa24umnn8a9995r\n/FxOTg6mTp2K1NRUpKamYuXKleZNTX2SXZ4NAwwcErSQScGTAIBXDRIREQBAZeoLMjMzUVxcjLS0\nNOTn52PZsmXYuHGj8flXX30VMTExyM/PN36uqakJs2bNwvPPP2+Z1NQnnG9lWRHaCPi4+OA/Z/6D\njs4OqBQmf6yIiMiOmTxzlZGRgeTkZABAeHg46uvr0dDQYHz+N7/5jfH5bo2NjWaOSf3RfdsbLsNg\nGZIkYULwBFxqvoR95/aJjkNERIKZ/Cd2VVUVYmNjjY+9vLxQWVkJjUYDANBoNKitrb1im6amJuh0\nOjzyyCNobm7G4sWLMX78eJNhdDpdX/MLeU1bs//sfqgkFVrLW5FXmSc6zoDLy7P8Pg+VhgIA1u1d\nB021xuLv1xf8GeAxkPv+AzwGct9/YGCPgcly9eOb0hoMBpOToqOjo7Fo0SLMmDEDhYWFeOihh7B9\n+3ao1eoet0tISOhF5N7T6XRmf01bc/DQQRQ0FCDUMxQxw2NExxlweXl5iIqKsvj7hOpD8Y/8f+BA\nzQHEx8dbzYUD/BngMZD7/gM8BnLff8Ayx6CnsmZyWNDPzw9VVVXGxxUVFfD29u5xm7CwMMyYMQMA\nEBoaCm9vb5SXl/c2L5lRUUMRWvWtvFmzhamVaowLHIeCmgKcrDopOg4REQlkslxNmjQJ27Z1XWKe\nm5sLX19f45Dg9WzatAnr168HAFRWVqK6uhp+fn5miEt9lVffNSQW7snJ7JZmXFCUVw0SEcmayWHB\n+Ph4xMbGIiUlBZIkYfny5UhPT4ebmxtmzpyJJUuWoKysDIWFhUhNTcXcuXMxc+ZMLF26FNu2bUNb\nWxtWrFhhckiQLCOv7r/lyovlytLGB42HUlLis1Of4dnJz4qOQ0REgvTqmvGlS5de8Tg6Otr432+8\n8cY1t1m3bl0/YpG55NXnQYKEMM8w0VHs3iDHQRjtPxoHSw/iXN05DHEfIjoSEREJwBXa7ZjBYMDp\nutMIHBQIFwcX0XFkYVrINADAptxNYoMQEZEwLFd2rLiuGJc7LnPx0AE0ZcgUKCUlPjnxiegoREQk\nCMuVHTty8QgALh46kDycPBDnH4eDpQdRXFssOg4REQnAcmXHdBe61uCI9IoUnERekkKSAHBokIhI\nrliu7NiRMp65EqF7aHBj7kbTX0xERHaH5cpOGQwG6C7o4OXoBXcnd9FxZIVDg0RE8sZyZadKL5ei\nsqkSQ1y5HIAIvGqQiEi+WK7sVPd8K5YrMYxXDebyqkEiIrlhubJTuossVyK5O7ljjP8YZJZmoqi2\nSHQcIiIaQCxXdqp7GQaWK3E4NEhEJE8sV3ZKd1EHX1dfuDm4iY4iW5OHTOaCokREMsRyZYcuXL6A\nsoYyLsEgmLuTO8YMHoNDFw5xaJCISEZYruxQ95AgFw8Vb9rQaQCAjSe45hURkVywXNmh7isFI7x4\n5ko049AgrxokIpINlis71H2lYJRXlOAk5O7kjrEBY3H4wmHkVeWJjkNERAOA5coOHbl4BN4u3tA6\na0VHIQC3hN0CAPj3sX8LTkJERAOB5crOlDeUo/RyKSezW5FJwZOgUWvwfvb70HfqRcchIiILY7my\nM91DgpzMbj0cVY5IGpqE8/Xnsatol+g4RERkYSxXdoZXClqnWeGzAHBokIhIDliu7Ez3mSsOC1qX\nET4jEOgWiPST6bjcell0HCIisiCWKzuju6CD1kkLbxdv0VHoByRJwi1ht6CpvYm3wyEisnMsV3ak\nsrESJfUliPCKgCRJouPQj/CqQSIieWC5siOcb2Xd/DX+iPOPw+7i3SisKRQdh4iILITlyo7wSkHr\nd8uwrrNX72e/LzgJERFZCsuVHeGZK+uXFJIEJ5UT1h9bD4PBIDoOERFZAMuVHdFd1MHd0R0+Lj6i\no9B1uDi4YMqQKSioKcC+kn2i4xARkQWwXNmJS82XUFRbhEivSE5mt3Kzwv675lUWJ7YTEdkjlis7\n0T0kGOHF9a2sXZx/HHxcfPBJ7idobGsUHYeIiMyM5cpOZJZmAgCivaIFJyFTlAolbou4DfWt9dhw\nfIPoOEREZGYsV3biYOlBAMBwn+GCk1BvzI6YDaWkxJrDazixnYjIzrBc2QGDwYCD5w/Cx8WHK7Pb\nCB9XH0waMglZZVnIOJ8hOg4REZkRy5UdOFd3DuWN5TxrZWPujrobALDm0BrBSYiIyJxYruxA95Bg\ntDfnW9mSOP84DHUfio25G1HRWCE6DhERmQnLlR04eP6/8628eebKlkiShLui7kKbvg3/OPIP0XGI\niMhMWK7swMHSg1BICkR5RYmOQn10S9gtcFY5453D70DfqRcdh4iIzIDlysa169tx5OIRhHiEwNnB\nWXQc6iNXtStmDpuJkvoSfHn6S9FxiIjIDFiubFxORQ6aO5o5JGjD7o7umtj+9qG3BSchIiJzYLmy\ncVzfyvaFeoZilN8o7Di7A6erT4uOQ0RE/cRyZeO6y1WMd4zgJNQfd0XdBQBYe2it4CRERNRfLFc2\n7uD5g3BWOWOI+xDRUagfpgyZAq2zFv/M+ifvN0hEZONYrmxYXUsdTlWdQrR3NJQKpeg41A8OSgfM\niZyDutY6LstARGTjWK5s2KELh2CAgZPZ7cTd0XfDSeWE/8v4P7Tp20THISKiG8RyZcO6Fw/lyuz2\nwd3JHbMjZuN8/XlsOL5BdBwiIrpBvSpXq1evxrx585CSkoLs7OwrnmttbcXTTz+Ne++9t9fbkHnw\nSkH7c3/s/VApVHhl3yvoNHSKjkNERDfAZLnKzMxEcXEx0tLSsGrVKqxcufKK51999VXExMT0aRvq\nP4PBgMzSTPi4+MDbxVt0HDITX1dfJA9LxqmqU/js1Gei4xAR0Q0wWa4yMjKQnJwMAAgPD0d9fT0a\nGhqMz//mN78xPt/bbaj/ztWdQ3ljOc9a2aH5I+ZDgoQ/ff8nGAwG0XGIiKiPVKa+oKqqCrGxscbH\nXl5eqKyshEajAQBoNBrU1tb2aZvr0el0fQrfG5Z4TWuw48IOAIC3wRt5eXk9fq2p5+2dLe5/nDYO\nhy8cxjvb38E473H9ei17/RnoC7kfA7nvP8BjIPf9Bwb2GJgsVz/+l7PBYIAkSWbfBgASEhJMfk1f\n6HQ6s7+mtdhQ1TXheerwqYjyv/4Nm/Py8hAVJd8bOtvq/v/S65d47D+PYXP5Zvxq1q9u+HXs+Weg\nt+R+DOS+/wCPgdz3H7DMMeiprJkcFvTz80NVVZXxcUVFBby9e57jcyPbUN8cLD0IpaREpFek6Chk\nAVHeUUgYnIBvCr/BodJDouMQEVEfmCxXkyZNwrZt2wAAubm58PX1NTm8dyPbUO+169uhu6hDiEcI\nnB2cRcchC1kwcgEA4E/f/0kixSkcAAAgAElEQVRwEiIi6guTw4Lx8fGIjY1FSkoKJEnC8uXLkZ6e\nDjc3N8ycORNLlixBWVkZCgsLkZqairlz5+LOO++8ahsyn+MVx9HS0cLJ7HZujP8YRHtF49NTn+JE\nxQnE+saa3oiIiIQzWa4AYOnSpVc8jo7+36KVb7zxRq+2IfPh4qHyIEkSUken4vlvn8cLu17Ap/M+\nFR2JiIh6gSu026DvS74HAIz0HSk4CVnahKAJGOE7Ap+d+gwZJRmi4xARUS+wXNkYg8GA3UW74eHk\ngeBBwaLjkIVJkoRfxP8CAPDMzme47hURkQ1gubIxRbVFKL1cilF+o3q1vAXZvpF+IzExaCL2ntuL\nr858JToOERGZwHJlY/ae2wuAQ4Jy80j8I1BICiz7Zhn0nXrRcYiIqAcsVzZmT/EeAMBov9GCk9BA\nCvUMxS3DbsHxiuP48PiHouMQEVEPWK5szJ7iPXB1cMUwz2Gio9AAezDuQaiVavx+1+/R0tEiOg4R\nEV0Hy5UNKWsow5lLZzDCdwSUCqXoODTA/DR+uDvqbpyrO4e1h9aKjkNERNfBcmVD9hZ3zbca5TdK\ncBISZcHIBdCoNXhp70uoa6kTHYeIiK6B5cqGdM+3YrmSL3cnd8wfMR/VzdX44+4/io5DRETXwHJl\nQ/ac2wO1Uo0oryjRUUig+4bfh0C3QLx+8HVkl2eLjkNERD/CcmUjapprcLz8OGJ8YuCgdBAdhwRy\nVDliyU1LoDfo8av//Aqdhk7RkYiI6AdYrmzEvpJ9MMDAIUECAIwLHIekoUnYX7If/8r6l+g4RET0\nAyxXNoLzrejHFiUugrPKGU/veBrVTdWi4xAR0X+xXNmIPcV7oJSUiPGOER2FrISPqw8ejHsQ1c3V\neGbnM6LjEBHRf7Fc2YDGtkboLuoQ6RUJZwdn0XHIitw3/D6EeYbh3aPvYn/JftFxiIgILFc24cD5\nA+jo7OCQIF1FqVDiyfFPAgB+9Z9foaOzQ3AiIiJiubIBnG9FPRnhOwK3R9yO7PJsvJbxmug4RESy\nx3JlA/ae2wsJEkb6jhQdhazUL+J/Aa2TFr/f9XuufUVEJBjLlZVr07ch43wGQj1D4eboJjoOWSl3\nJ3csnbQUbfo2LExfiNaOVtGRiIhki+XKyh2+cBgtHS0Y5cshQerZhKAJmBM5B8crjuOFb18QHYeI\nSLZYrqyccb6VP8sVmfbY2McQNCgIf874M74r+k50HCIiWWK5snI7z+4EAIz2Gy04CdkCZwdnLJu8\nDApJgQc+ewAN7Q2iIxERyQ7LlRVraGvA3nN7EaGNgNZZKzoO2YgYnxgsHLUQ5+rO4dWcV0XHISKS\nHZYrK/Zd0Xdo07chMTBRdBSyMQtHLUS0dzS+Kv0Kn5z4RHQcIiJZYbmyYl/nfw0AGBcwTnASsjUq\nhQrPTX4OaoUaj3z+CE5VnRIdiYhINliurNjX+V/DxcEFsb6xoqOQDQp2D0bqsFRcbruMe9LuweXW\ny6IjERHJAsuVlcq/lI+CmgIkDE6ASqESHYdsVKJ3Iu6PuR+nqk7hwS0PwmAwiI5ERGT3WK6sVPeQ\nIOdbUX/9MuGXiPOPQ/rJdLy6jxPciYgsjeXKSm0r2AYASAxguaL+USqUeHHqi/Bx8cFz3z6HHQU7\nREciIrJrLFdWqLWjFd8Wfouh7kPhr/EXHYfsgKezJ1ZMWwGlpMT8zfNRVFskOhIRkd1iubJC35/7\nHk3tTRwSJLOK8YnB4psWo7q5Gnd9fBfqW+tFRyIiskssV1aISzCQpcyOmI05UXOQXZ6Ne9PuRZu+\nTXQkIiK7w3Jlhb4u+BqOSkeM9uctb8i8JEnCknFLMDF4Ir4p/AYPb3kYnYZO0bGIiOwKy5WVOV9/\nHjkVOYjzj4NaqRYdh+yQUqHE76f+HjE+Mfjw+Id47pvnREciIrIrLFdWZlt+11WCYwPGCk5C9sxJ\n5YTVN69G0KAgvLLvFbyd+bboSEREdoPlysp8XfDf+VaBnG9FluXu5I5Xkl+Bp5MnFm9djPST6aIj\nERHZBZYrK9LR2YEdBTvgr/FH8KBg0XFIBgLcAvCn5D/BSeWE+ZvnY+uZraIjERHZPJYrK5JZmom6\n1jqMCxgHSZJExyGZiPKKwqqbVwEA7km7xzg0TUREN4blyorwljckSvzgeLx080swwIC70+7GzrM7\nRUciIrJZLFdWZEveFjgoHBA/OF50FJKhsQFjsXL6Sug79Zjz0Rx8W/it6EhERDaJ5cpKnKo6hezy\nbCQGJsLFwUV0HJKpcYHj8Mfpf0RHZwdmb5iN3UW7RUciIrI5LFdWIi0nDQAwLWSa2CAke+ODxuMP\n0/6A9s523L7hds7BIiLqI5YrK/FJ7idQK9WYFDxJdBQiTAiegD9O6zqDdedHd+LjnI9FRyIishm9\nKlerV6/GvHnzkJKSguzs7Cue279/P37yk59g3rx5ePvtroUIc3JyMHXqVKSmpiI1NRUrV640f3I7\nklORg9zKXNwUeBOHBMlqTAiegFeTX4VaqcaCzQu40CgRUS+pTH1BZmYmiouLkZaWhvz8fCxbtgwb\nN240Pr9q1Sq8++678PPzw4IFCzBr1iw0NTVh1qxZeP755y0a3l5wSJCs1Wj/0fjrrX/F0zuexq+3\n/hpVTVV4MelFLhVCRNQDk2euMjIykJycDAAIDw9HfX09GhoaAAAlJSVwd3fH4MGDoVAokJSUhIyM\nDDQ2Nlo2tR0xGAz4JPcTOKmcMCFogug4RFcJ14bjzdvehL/GHyt2r8CSrUug79SLjkVEZLVMlquq\nqip4enoaH3t5eaGyshIAUFlZCa1Wa3zO29sblZWVaGpqgk6nwyOPPIKf/vSnOHDggAWi24dj5cdw\nuvo0xgeNh7ODs+g4RNcUOCgQb972JkI9QvHWobdw7yf3oqGtQXQsIiKrZHJY0GAwXPW4e0jgx88B\ngCRJiI6OxqJFizBjxgwUFhbioYcewvbt26FWq3t8L51O15fsvWKJ1zSnt06+BQCIdIhEXl6eRd7D\nUq9rK7j/5tv/xeGL8fczf8fneZ8j4e0E/CXxL/Bz9jPb61uKtf89YGly33+Ax0Du+w8M7DEwWa78\n/PxQVVVlfFxRUQFvb+9rPldeXg4fHx+EhYUhLCwMABAaGgpvb2+Ul5cjOLjn++UlJCTc0E5cj06n\nM/trmpPBYMCefXvgrHLGvePuhaPK0ezvkZeXh6ioKLO/rq3g/pt//98c/iZeP/g6vjz9JR45+Ag+\nn/85xgaMNet7mJO1/z1gaXLff4DHQO77D1jmGPRU1kwOC06aNAnbtnWtc5ObmwtfX19oNBoAQFBQ\nEBoaGnD+/Hl0dHRg165dmDRpEjZt2oT169cD6Bo6rK6uhp+f9f/rdqDpLupwtuYsJgZPtEixIrIE\nlUKF347/LR4f+zjKGsow9Z9TkX4yXXQsIiKrYfLMVXx8PGJjY5GSkgJJkrB8+XKkp6fDzc0NM2fO\nxIoVK/DUU08BAG6//XaEhoZCq9Vi6dKl2LZtG9ra2rBixQqTQ4JyxKsEyVZJkoT7Y+9H4KBArNqz\nCvd9ch+WJy3Hi0kvQiFx+TwikjeT5QoAli5desXj6Oho438nJiYiLS3tiufd3d2xbt06M8SzX91X\nCbo6uGJc4DjRcYhuyMTgiXjjtjfw+12/xx92/wGHLxzG+/e8D09nT9MbExHZKf4TU5CDpQdxru4c\nJg2ZBLWSZ/XIdoVrw/HOHe9gbMBY/OfMf5C4LhHHy4+LjkVEJAzLlSAcEiR74u7kjpdnvIwFIxeg\noKYA498dz1vmEJFssVwJ0KZvw0c5H8FN7Yaxg633KiuivlAqlHg0/lH8cdofAQDzN8/H4q8Wo7Wj\nVXAyIqKBxXIlwObczShvLMet4bfCQekgOg6RWU0ZOgVr71iLEI8QvHXoLUx8byLyL+WLjkVENGBY\nrgR469BbkCDhrqi7REchsogh7kOw9o61uD3idhy5eATxf4vnMCERyQbL1QA7cvEI9pfsx7jAcQgc\nFCg6DpHFOKmc8LuJv8NzU55DR2cH5m+ej19+8Us0tzeLjkZEZFEsVwPs7cy3AQD3RN8jOAnRwJg5\nbCb+NvtvCPMMw9+P/B1j141FVlmW6FhERBbDcjWAqpuqsSFnAwLcApAYmCg6DtGACXYPxpo71uCe\n6HuQW5mLcevG4f/2/x86DZ2ioxERmR3L1QB67+h7aOlowd1Rd3MVa5IdtVKNJTctwcszXoab2g2/\n2/E7zHx/Js7XnxcdjYjIrPgbfoDoO/VYc3gNnFROuDX8VtFxiIS5KegmvHvXu5gYPBHfFn6LUWtH\nGdd9IyKyByxXA+SrM1+hqLYIycOS4eboJjoOkVAeTh5YNX0Vfjvht2juaEbK5hTM2zQPVU1VoqMR\nEfUby9UAeevQWwCAu6PuFpyEyDpIkoQ7I+/EujvXIdYnFp+c+ASxa2Kx5dQW0dGIiPqF5WoAnK4+\nje0F2zHKbxTCtGGi4xBZlaBBQXj91tfxWMJjqGmuwd1pd+OBzx5AbUut6GhERDeE5WoArDm0BgCX\nXyC6HqVCiXkj5uFvs/+GSK9IrD+2HrFrYvF53ueioxER9RnLlYVVNlbivaPvwdvFG5OHTBYdh8iq\nhXqG4u3b38bDcQ+jorECd318F+Zvno/KxkrR0YiIeo3lysJW7VmFy22XMX/EfKgUKtFxiKyeSqFC\n6uhU/H323zHcezg+zvkYMWti8NHxj2AwGETHIyIyieXKgs7WnMXaw2sRoAnAnZF3io5DZFNCPUPx\n5m1v4vGxj+Ny62UsSF+AOR/Pwbm6c6KjERH1iOXKgp7/9nm0d7bj5/E/h4PSQXQcIpujVChxf+z9\neO+u9zDGfwy+PP0lYt6OwWsZr6Gjs0N0PCKia2K5spDDFw7j45yPEekViWkh00THIbJpAW4B+PMt\nf8Yzk56BSqHCU9ufwk3/uAm6CzrR0YiIrsJyZQEGgwHP7HwGAPDLhF/yVjdEZiBJEm4NvxX/vvvf\nmBU2C0cuHsG4f4zDE1ufQF1Lneh4RERG/K1vAdsKtuHbwm8xLnAc4gfHi45DZFfcndzx7ORn8dot\nryHALQBvZL6B6Lej8WH2h5zwTkRWgeXKzPSdejyz8xlIkPCL+F+IjkNkt8YMHoN357yLh+MexqXm\nS1j46UJM+/c05FTkiI5GRDLHcmVmG45vQHZ5NmaGzeRq7EQWplaqkTo6Ff+661+YFDwJe4r3IO6d\nODy17SkOFRKRMCxXZtTQ1oAXdr0AtVKNh+MeFh2HSDYGuw3GqptXYfWM1fDT+OG1A68h8q1IrNOt\ng96gFx2PiGSG5cqMlmxdgnN153B/zP3w0/iJjkMkOxOCJuCfd/0TPx/zc1xuvYxffPkLpO5Nxe6i\n3aKjEZGMsFyZSVpOGv6Z9U9EaCPwwOgHRMchki21Uo2FoxZi/T3rMStsFk7Xn8a0f0/DTz75CQou\nFYiOR0QywHJlBkW1Rfjll7+Es8oZv5/6ey4YSmQFvF288ezkZ/HMiGcQ6xOLzSc3I/rtaCzZuoT3\nKiQii2K56qeOzg4sTF+IutY6LB63GMHuwaIjEdEPhGq6bqOzPGk5fF198Wbmmwh7Iwwv7XkJjW2N\nouMRkR1iueqnVXtWYV/JPkwLmYZbw28VHYeIrkGSJEwLmYZ/3fUvLBm3BApJgRd2vYCINyOw5tAa\ntHa0io5IRHaE5aofvj/3PVbuWQk/Vz88NeEpSJIkOhIR9cBB6YB7ht+DD+/9EKmjUlHTUoNFXy0y\nXlnYrm8XHZGI7ADL1Q0qbyjHT9N/CgB4furz0Kg1ghMRUW+5ql3x8JiH8eG9H+InMT9BWUMZfvHl\nLxD1VhT+lfUv3hSaiPpFJTqALaporMCM9TNwru4cHop7CCN9R4qOREQ3QOusxaLERZgXOw8bjm/A\nl6e/xENbHsIfdv8Bv5v4OzwU9xCcHZxFxyQzade3o6S+BIU1hSiqLUJxXTEqGytR3Vzd9aepGpea\nL6GlowXtne3o6OxAu77ro4PSAS4OLnB1cO36qHaFh5MHBmsGd/1x6/oYNCgIkV6R8HX15WiGjLFc\n9VFVUxWS1yfjROUJ3Df8PqSOShUdiYj6ydvFG0tuWoKUESnYcHwDtuZvxaKvFuEPu/+AJ296Eo8n\nPg53J3fRMamXWjpacLLyJHIqcpBTkYPvz3yP0r2lKKkvQaeh87rbuTi4wE3tBhcHFygVSiglpfGj\nvlOPFn0LWjtaUdFYgZa6FjR3NF/3tQY5DkKkVyQitBGI9o5GnH8cxviPQdCgIJYuGWC56oPqpmok\nr0/G8YrjuCf6HixKXMQfEiI74uvqiyfHP4mfjf4ZNp/cjC2ntuC5b5/Dy/texqPxj2JR4iKEeoaK\njkk/0NHZgRMVJ5BZmtn150ImTlScuGplfi9nL8T6xMJf4w9/jT8GawbDX+MPDycPuDu5w03t1udl\ndNr0bbjUfAmXmi+huqnr7Fd5YzlK67uKXHZ5Ng5fOHxVjjj/OMQPjsf4oPGYEDQBg90G9/s4kHVh\nueqlS82XMPP9mThWfgxzouZg8bjFLFZEdkrrrMWj8Y9i/oj5+CLvC2zK3YQ/Z/wZr2W8Zvz5vzn0\nZv4dIMDl1svIOJ+BvcV7sffcXmSWZl5xBslR6Yho72iEeYYh1DMUoR6h0FfrER8bb/YsaqXaWNau\nRd+pR2VTJYpqi5B/KR9nLp1B/qV8fFP4Db4p/Mb4dUPdh2Ji8ERMCJqAqUOnYqTfSCgkTom2ZSxX\nvVDRWIHbP7wdR8uO4o6IO/DETU/wL1UiGdCoNZg/cj7ui7kP3xV9h09PfooteVuwJW8LYnxi8Kux\nv8KCkQugddaKjmq3altqsad4D74r+g67i3cjqyzLOLQnQUKoZyiGew9HtHc0or2jEeoRCqVCecVr\n5NXliYgOpUJpLF/jg8YbP9/Q1oAz1WdwovIEcitzkVuZi49yPsJHOR8BADydPDFl6BRMHTIVSSFJ\nGOM/5qp9IuvGcmXC1/lf44HPHkBFYwVuC78Nv53wW/6Lgkhm1Eo1bgm7BbeE3YLcylykn0zH7uLd\nWLx1MZ7a/hTujr4bD8c9jORhyfwl2E/1rfXYW7wXu4p2YVfRLhy9eBQGGAAADgoHxPjEYJTvKIz0\nG4kRviNs8kptjVqDMYPHYMzgMQAAg8GA0sulOF5+HNnl2ThWfgyf532Oz/M+BwC4O7ojKSQJN4fc\njOmh0zHCdwR/D1k5lqvraO1oxbM7n8VfD/4VDgoHPD72cdwXcx+/oYlkLsYnBjE+MXi8+XHsKNiB\nrflb8cmJT/DJiU8QNCgIC0cuxP2x92OM/xie4e6FpvYm7Du3D98WfotdRbtw+MJh43wplUKFkb4j\nETc4DnF+cYjxiYGjylFwYvOTJAlBg4IQNCgIt0XcBqBrxORY2TFklWch62LWFWXL28Ub00Om4+bQ\nm3Fz6M2I0Ebwe83KsFxdQ25lLhZsXoBj5ccwxH0IXpjyAiK8IkTHIiIronXWYt6IeZgbOxcnq05i\na/5W7CrchZf3vYyX972MMM8w3B9zP4vWjzS0NWB/yX7jMN+h0kNo7+xavFUpKRHtHY0x/mMQNzgO\nsT6xcFI5CU4shq+rL2aGzcTMsJkAgLKGMmSVZeHIxSPIKsvCxtyN2Ji7EQAQ6BaI6aHTMT1kOpKG\nJmGY5zB+vwnGcvUD1U3V+OuBv+LPGX9Gc0cz7oy8E48nPi7bH24iMk2SJOPZrF8n/hoHSw9id9Fu\nZJzPMBatEI8Q3BZ+G24NvxXTQ6bDzdFNdOwBU9FYgX3n9mFfyT58f+576C7qjIu0KiUlIrwiMNpv\nNOIHx2Ok70iuK3Yd/hp/3Bp+K24Nv9U4jNhdtI6WHcUH2R/gg+wPAABBg4KQNDQJSUOTMHXoVBgM\nBsHp5YflCkBlYyVey3gNbx16Cw1tDfB08sSyycswZegU0dGIyIY4qhwxdehUTB06Fa0drcgszcR3\nxd8hszQTaw+vxdrDa+GgcMDkIZNxS9gtmDxkMsYGjLWbf8C1dLTgWNkxHL5wGIcuHMK+kn3Iv5Rv\nfF4pKRHlHYXRfqMR5x+HEb4j4OLgIjCxbfrhMOKcqDkwGAwoqi1CVlkWjpUfw7HyY/jw+If48PiH\nAAB3B3dMPTMVk4InYWLwRIwNGMsSa2GyLleFNYVYc2gN1hxeg6b2JmidtXh87OO4M+pOu/nLjojE\ncFQ5YsrQKZgydAr0nXrkVuXiUOkhZJZmGidrA12TtBMCEoy/+OL84xDiEWL18zurm6qRU5GD4xXH\njes5Ha84fsWtgzRqDW4KvAkjfEdghO8IRHtH8+9WC5CkrqsmQz1Dcc/we2AwGFBcV4xjZcdwvOI4\nsi5k4YvTX+CL018A6JrLNsJ3BBIDErv+BCYi1ie2z+t80fXJrlzlX8rHptxN2JS7CbqLOgBdkwN/\nPubnuCPiDrucLElEYikVSoz0HYmRviPx8JiHUdNcg2Plx4wriB8qPYQD5w/gzxl/BtBVSrq/fqTf\nSIRrwxHqEYqhHkMHtJw0tTfhbM1ZFFwqQEFNAQouFeDMpTM4XnEcZQ1lV3ytWqk2rkYe5R2FKK8o\nDHEfYvUl0R5JkoQQjxCEeITgrui7kJeXB22QFicqTyCnIgcnq04itzIXWWVZWHdkHYCu9cFifGIw\nym8URvmNMn7v+bn6cf7WDZBNufq28Fs8tf0pZJVlAeg6PZ0YkIhpIdOQPCwZaqVacEIikgtPZ09M\nC5mGaSHTAADN7c04VXUKJ6tO4mzNWZytOYvM0kxknM+4atsAtwCEeoTCT+MHHxcf+Lj4wNvFGz6u\nPtCoNXBWOcPZwRlOKic4q5xx9vJZqMvV0Bv00HfqoTfo0dLRgvrW+iv+VDdV42LDxa4/l7s+Xmq+\ndM38fq5+GB80HqEeXYt0hnqGIsQjBCqFbH6l2BwfVx9Mc/3f91xHZweKaotwquoUTlWdwplLXetu\nHS07esV27o7uiPSKRJR3FCK1kYj0ijT+//Zx8WHxuo5e/SSsXr0ax44dgyRJeO655zBq1Cjjc/v3\n78drr70GpVKJqVOnYtGiRSa3EWHn2Z3IqcjB+KDxSBqahInBEzHIcZDQTEREAODs4HzFukdA102G\nz9WdQ2FtIcoayq4oPQfOH7jq9i492t23PBq1Bl7OXgj1CMVgt8EIcAtAgFsAAt0CEeAWwHlSdkCl\nUCFcG45wbThmR84G0LWifOnlUmPBL6otwrm6c8gqy8KhC4eueg1nlTOGegzFUPehCHALuOIG1oPd\nBsPHxQdeLl7wcPKQ3RlMk+UqMzMTxcXFSEtLQ35+PpYtW4aNGzcan1+1ahXeffdd+Pn5YcGCBZg1\naxYuXbrU4zYivHTzS5gTNQctHS1CcxAR9YaD0gFh2jCEacOuek7fqcfltsuobalFXUsdalu7PrZ0\ndN1YuE3fhlZ9K1r1raitqYXWUwuFQgGFpIBSUsJB4QBXtStcHFygUWuMNyz2cvGC1lnLM/kypVQo\nMcR9CIa4DzGe4QK6vt/KG8tRUleCkvoSlDeWo7yhvKv0X76IU1WnenxdhaSAp5MnvFy84OnkiUGO\ng+Du5I5B6kEY5DgIGrXG+P3o6tD10dnBGY5KRziqHOGkcoKj0hFqpRoOSoeujwoHOCgd4KBwMN5c\nW6VQQano+ii6zJksVxkZGUhOTgYAhIeHo76+Hg0NDdBoNCgpKYG7uzsGD+666WRSUhIyMjJw6dKl\n624jiiRJcFA4oE1qE5ZBBIWkEP5NJhL3X977D9jnMVAoFdA6a3t12528vDxERUUNQCrrZY/fA33R\n3/1XKBXGqxMnYMJVzze1N11x8+rqpmpcar6EutY61LXUoa61DvWt9ahorMDZmrNXXPRgCW5qN+z/\n+X6M8B1h0ffpiclyVVVVhdjYWONjLy8vVFZWQqPRoLKyElrt/364vb29UVJSgpqamutu0xOdTncj\n+9Cn13SFq9nfw5rFe8UDVaJTiMP9l/f+AzwGct9/gMfA0vvvClf4wKerUbj9908P2vRtaOhoQGNH\nIxraG9Ckb0KLvgUt+hY065vRom9Bq74V7Z3taOtsQ5u+DW2dbWjvbEeHoQMdnR1XfNQb9Og0dKLT\n0IkOQwdcVa64kH8BrSWtV7yvJTrG9ZgsVz9efMxgMBgnsF1rYTJJknrcpicJCQkmv6YvdDqd2V/T\n1sj9GHD/5b3/AI+B3Pcf4DGQ+/4DljkGPZU1k+XKz88PVVX/q7wVFRXw9va+5nPl5eXw8fGBSqW6\n7jZERERE9szkIOykSZOwbds2AEBubi58fX2Nw3tBQUFoaGjA+fPn0dHRgV27dmHSpEk9bkNERERk\nz0yeuYqPj0dsbCxSUlIgSRKWL1+O9PR0uLm5YebMmVixYgWeeuopAMDtt9+O0NBQhIaGXrUNERER\nkRz0ap2rpUuXXvE4Ojra+N+JiYlIS0szuQ0RERGRHMj32lQiIiIiC2C5IiIiIjIjlisiIiIiM2K5\nIiIiIjIjlisiIiIiM2K5IiIiIjIjlisiIiIiM2K5IiIiIjIjlisiIiIiM5IMBoNBdAig57tLExER\nEVmbhISEa37easoVERERkT3gsCARERGRGbFcEREREZkRyxURERGRGbFcEREREZkRyxURERGRGdl1\nuero6MAzzzyDBQsWYO7cuTh8+LDoSANm9erVmDdvHlJSUpCdnS06jhCvvvoq5s2bh/vuuw/bt28X\nHUeIlpYWzJgxA+np6aKjCPH5559jzpw5uPfee7F7927RcQZUY2Mjfv3rXyM1NRUpKSnYu3ev6EgD\n5vTp00hOTsYHH3wAALh48SJSU1OxYMECPPHEE2hraxOc0LKutf8PPvggFi5ciAcffBCVlZWCE1re\nj49Bt7179yIqKsri72/X5WrLli1wdnbGhg0b8NJLL+Hll18WHWlAZGZmori4GGlpaVi1ahVWrlwp\nOtKAO3DgAM6cOYO0tL1yjvQAABeuSURBVDT84x//wOrVq0VHEmLt2rXw8PAQHUOImpoavP3229iw\nYQPeeecd7Ny5U3SkAfXpp58iNDQU77//Pl5//XW89NJLoiMNiKamJqxcuRITJkwwfu6NN97AggUL\nsGHDBgQGBmLTpk0CE1rWtfb/r3/9K+bOnYsPPvgAM2fOxD//+U+BCS3vWscAAFpbW/H3v/8dPj4+\nFs9g1+Vqzpw5WLZsGQBAq9WitrZWcKKBkZGRgeTkZABAeHg46uvr0dDQIDjVwEpMTMTrr78OAHB3\nd0dzczP0er3gVAOroKAA+fn5mDZtmugoQmRkZGDChAnQaDTw9fWV3T8yPD09jX/n1dfXw9PTU3Ci\ngaFWq7Fu3Tr4+voaP3fw4EHMmDEDADBjxgxkZGSIimdx19r/5cuXY9asWQCu/L6wV9c6BgDwzjvv\nYMGCBVCr1RbPYNflysHBAY6OjgCAf//735g9e7bgRAOjqqrqir9Ivby8ZHEa+IeUSiVcXFwAABs3\nbsTUqVOhVCoFpxpYr7zyCp599lnRMYQ5f/48DAYDnnzySSxYsMCuf6Feyx133IELFy5g5syZWLhw\nIZ555hnRkQaESqWCk5PTFZ9rbm42/kL18fGx678Pr7X/Li4uUCqV0Ov12LBhA+68805B6QbGtY5B\nYWEhTp06hdtuu21gMgzIuwyAjRs3YuPGjVd8bvHixZgyZQo+/PBDnDhxAu+8846gdAPrx4vuGwwG\nSJIkKI1YO3fuxKZNm/Dee++JjjKgPvvsM8TFxSE4OFh0FKHKy8vx1ltv4cKFC/jZz36GXbt2yeZn\nYcuWLQgICMC7776LU6dO4fnnn8fmzZtFxxLih//P5XpTkv9v796jmrjyOIB/KSCitEVlxQdgVyUB\nWx8IuiggIqtolZ5WRDmgQOsDllIrxfqoEVBrFd9WCoKl2K4WxdJyfGCx6NGi6Pqo1u0qFKSCoiAK\nKAiEkNz9g5NpAkkmQiQVfp9zcg6ZmXvnd+/MkDtz78xIpVIsW7YMzs7OrbrLuoINGzZAJBJ12Po6\nTePK19cXvr6+raYfOnQIp06dQnx8PIyNjfUQWceztLTEw4cPue8PHjyAhYWFHiPSj5ycHOzevRtf\nfvklXn75ZX2H06FOnz6NO3fu4PTp0ygrK0O3bt3Qr18/jB8/Xt+hdZg+ffrAwcEBRkZGsLGxQc+e\nPVFZWYk+ffroO7QO8csvv8DV1RUAYGdnh/LycjQ1NcHIqNP829eaqakpGhoa0L17d5SXl7fqLuoK\nVq5ciUGDBiE8PFzfoXS48vJyFBUVYenSpQCafxPnzp3barC7LnXqbsE7d+7gwIEDiIuL47oHuwIX\nFxdkZWUBAG7cuIG+ffvCzMxMz1F1rJqaGmzatAmJiYldckD3jh07kJ6ejrS0NPj6+iIsLKxLNawA\nwNXVFRcuXIBMJkNlZSXq6uq6zLgjABg0aBB+/fVXAEBpaSl69uzZJRtWADB+/Hjuf+KJEyfg5uam\n54g61uHDh2FsbIzFixfrOxS9sLS0RHZ2NtLS0pCWloa+ffs+14YV0ImuXKly6NAhVFdXY9GiRdy0\n5OTkDhnMpk+jR4/G66+/Dj8/PxgYGCA6OlrfIXW4zMxMVFVVYcmSJdy02NhYDBgwQI9RkY5kaWkJ\nLy8vBAUFob6+HiKRCC+91KnPJ5XMmTMHn3zyCebOnYumpibExMToO6QO8dtvvyE2NhalpaUwMjJC\nVlYWtmzZghUrVuDgwYMYMGAA3n77bX2H+dyoKv+jR49gYmKCefPmAQCGDBnSqfcHVXWwa9euDj3R\nNmBdtQOaEEIIIeQ56DqncYQQQgghHYAaV4QQQgghOkSNK0IIIYQQHaLGFSGEEEKIDlHjihBCCCFE\nh6hxRQghhBCiQ9S4IoQQQgjRIWpcEUIIeS6OHz+O0NBQuLm5wcHBATNnzsTRo0efS7ry8nI4ODhA\nKBTi6dOnuirCc9HWegGAwsJCBAUFYeTIkXB1dcXOnTshlUpbLZednQ1vb2+88cYbmDRpElJSUnRd\nDKKBYUxnfkwr0SmhUIh+/frh9ddf12scRUVFCAwMxMaNGzF27FidPHW9qakJ9vb2GDhwIOzt7Z95\n/oto+PDh6NevX6cpz7PSxTblOyZ0ecy0JS+pVIrg4GDcunVLL68/EolEsLCwQGBgIGbNmgWpVIrY\n2FiYm5tj5MiROk0XHR2N8vJy1NXVISQk5C/9Jo621svjx48xe/Zs9OnTB1FRURAKhYiLi4NYLFZ6\nGfOVK1ewYMECeHh4ICIiAr1798bOnTvRs2dPjBo1Cps3b0ZiYiLeeuutLvXWgo7UqV9/0xnNmzcP\nly5dwv79++Ho6Kg0b8WKFQCAjRs36iO0DnPw4EHU1NTgwoULMDU11Xc4L6z//ve/+g5Bo8uXL0Mi\nkSj9aJBns2PHDtTV1Sm9BqojJSQkoHfv3tz3cePG4cGDB0hJSeFexaKLdJcvX0ZOTg5CQkKwadMm\n3RbiOWhrvRw4cABisRhxcXEwMzODi4sLamtrERcXh4ULF3LvkI2Pj4ejoyPWr18PoPk9m0+ePEF8\nfDz8/f0REREBPz8/7Nq1S2/7RmdHTdYXUK9evRAVFYXGxkZ9h6IXT548Qf/+/dGjRw8YGBjoOxzy\nnHz99de4cOGCvsN4Yd27dw8pKSn46KOPuBc2V1ZWQigUIjc3V2nZ9evXY/bs2TqPQbEBIWdvb4/K\nykqdpZNKpVi3bh3CwsLa9WLuiooKLF++HOPHj4ednR2EQiH3mTlzZpvzVaWt9fLzzz/D1dWVa0QB\nwPTp09HQ0ICLFy9y027evNnqpMTFxQWPHz/GtWvXYGRkhIiICCQnJ+P+/fvtLA1RhRpXLyBfX18A\nQFJSktplhEIhDh06xH1vamqCUCjE999/r7RMRkYG5s+fj1GjRmHq1Km4fv06UlNTMXHiRDg6OmLF\nihVK/fnV1dUICwuDg4MDJk2ahK+++kppvY8fP0Z0dDTc3d0xcuRIvPPOOzhz5ozSOvfu3QsvLy8E\nBwerjL2qqgorV67ExIkT4eTkhNmzZ+Ps2bMAgIULFyIjIwNXr17F8OHDcenSJZV5FBUVYdGiRXB2\ndoajoyMCAgLwv//9j5tfUFAAPz8/ODg4YOrUqcjJyVFKzzdfXVn4yn/s2DF4e3vDwcEBY8eORXh4\nOMrLy7Wa3968VcUv30eEQiGOHDmCxYsXw9HREa6urti9e7fatPI0+/fvx6JFizBq1Cg4OzsjOTmZ\nm69pO/LF7OfnhxMnTmDPnj1wcnICADx8+BCRkZEYM2YMnJ2dERkZqfRjxLc+vm3KV7/q8B0Tivhi\n5CujopiYGHh5eeHRo0cq5//73/+GjY2NUndgfn4+AMDOzk5p2fz8fAgEApX5MMbQ1NTE+9HW1atX\nMWTIEK2X50snv5oTEBDwzHnKicVivPvuu7h06RI+/vhj7N69m9vv5syZg/nz5ystr+s60VQ+RUVF\nRRg8eLDStAEDBsDU1BRFRUVK5WnZLSr/fuvWLQDNjS0bGxt88803zxQn0RIjL5S5c+eyzz//nF25\ncoUNHz6cFRYWcvOWL1/Oli9fzhhjTCAQsLS0NG6eRCJhAoGApaenc9MEAgGbMWMGu3nzJhOLxWzB\nggXM3d2dbdy4kdXX17OCggL2xhtvsJMnT3LLu7m5sdzcXNbY2MgyMzOZQCBgp06d4vL09/dnISEh\nrKKigonFYrZv3z42bNgwVlJSwuUxffp0VlBQwGQymdoyzp07l927d4/Lw87Ojv3+++9cOf38/DTW\n04wZM9jSpUtZfX09q6+vZ8uWLWMeHh6MMcZkMhmbMmUKCw8PZzU1NayiooKFhoZy9cM3X7H+WpZF\nU/nLysqYvb09O336NJPJZKyyspK9//777KOPPmKMMd757clbFcV9RCAQsClTprBLly6xpqYmdvDg\nQSYQCFh+fr7G9O7u7uzixYussbGRHTt2jAkEAnbu3DmttiNfzB4eHmzbtm3c+ubMmcPef/99VlVV\nxaqrq1lwcDALCgrSar/RZpvy7bvq6kDTMdHyOOSrE01lVMzriy++YO7u7qy0tFRtbDNmzGCffvqp\n0rTk5GTm4uLSatmxY8eyb775RmU+6enpTCAQ8H60kZuby4RCodJx1J50lZWVbMyYMez06dNKsdbW\n1j5T/tu2bWOjR49mZWVl3LTi4mImEAjYDz/80Gp5XdaJpvK1NGzYMJaSktJqupubG9u6dSv3/Z13\n3mHh4eFKyyQmJjKBQMASEhK4aZ999hmbPn261nES7dGYqxfU6NGjMXPmTIhEInz77bdt7h7z8PDg\nzmInTpyI8+fPY8mSJTAxMcHQoUMhFApRWFiISZMmAQAmTZrEXW6eNm0aEhMTkZ2dDQ8PD+Tl5eHy\n5cv48ccfYWFhAQAICAhAeno60tPTub59Nzc3DB06VGU8v//+Oy5evIjvvvsO/fv35/I4cOAA0tLS\nsGrVKq3KlZqaCiMjI3Tv3h0A8OabbyIjIwMVFRW4f/8+bt++jZ07d8LMzAxmZmYICwvDqVOnADSP\nRdI0X5FiWfjK7+3tDalUClNTUxgYGKBXr17YtWsXt+1qa2vVzm9v3trw9PTkzta9vb2xevVqjVc0\n5GnGjBnD1XFSUhKysrJgYWHBux01lbelvLw8XL16FYcPH4a5uTkAYM2aNbh58yYYYygoKNC4Pm9v\nb43bVNt9VxVNx4Qivn3bx8dHYxnl0tPTsX//fuzbt0/tzRxSqZS7q6xlPba8alVWVobq6moIhUKV\neXl4eOC7775TW35t3b17F5GRkfD09HymbjZN6bZv344RI0bA3d29XbEdOXIEs2fPhqWlJTfN2toa\nL730Empqalotr6s6AZ69XlQdH4wxpel+fn6IiYlBWloavLy8cP36de5uQcUB7HZ2dti7dy8kEgmM\njY11UBoiR42rF9jSpUsxbdo0pKamwt/fv015DBw4kPvb1NQUFhYWMDExUZomFou577a2tkrprays\nUFZWBgDcZem33npLaRnGmFJjysrKSm08JSUlKtczZMgQ3L17V6syAc2X2L/44gsUFhZCLBZzP05i\nsZgbY6AYh+L6+OYrUlyGr/xDhgxBYGAggoODIRAIMG7cOEydOpW7O0jT/PbmrY1BgwZxf8tvFGho\naNCYpmUXhbW1NcrKyrTajs8S8+3btwEo17eNjQ1sbGwA8O83fNtU231XFU3HhCK+GPnKCAA5OTk4\nefIkYmJi8Pe//11tTNXV1ZDJZK3GIOXl5WHChAmtpgFQ27gyNzfHyy+/rHZd2qiursbChQvRv39/\nbN68WSfpCgoK8P3332Pfvn148uQJAKC+vh5A84mKoaEhd3Klya1bt1BaWtpqjFJlZSVkMhn+9re/\ntUqjizoBnr1eXnnlFZWNvdraWqV4fHx8kJeXh5iYGKxevRqmpqZYunQp1q1bx508AH+O/aqqqkLf\nvn3bXR7yJ2pcvcDMzMwQFRWFFStWwNPTU+OyMplM5fSWt+Hy3Zbb8nkqjDGuL1/eKDt79ixeffVV\ntXloc4u04pk60By/RCLhTQcAf/zxB/71r39h3rx52L17N8zNzZGTk4MFCxYAAHcjgOKZnmL98M1X\nVxZtyr9q1SosWLAAZ8+exc8//4yAgADMnz8fERERGuePGDGi3Xnzacst2S3rpeUZNN921DZmQ0ND\nlfm1pG59fNtU231XFU3HxLPEqE0Z//Of/2Dy5Mn4/PPP4enpqXJgtCLF8jY2NqKoqKjV+KFffvkF\nlpaWasv9ww8/YOXKlRrXA/w5nqul+vp6hIaGQiKRICkpCT169ODNS5t0xcXFkEgkmDNnTqu0EyZM\nwKxZs7i75TSRj0vs06eP0vScnBwYGxvDxcWlVZr21gnQtnoZPHiw0tgqoPlksK6uTulEx9DQEFFR\nUfjwww9RVlYGKysrLp3iyQvf8UTajhpXL7jJkycjIyMDa9euVTpzMTExUbrqUFxcrJP1yc++5e7e\nvcs9EuK1114DANy4cUPpLPDOnTuwsrLSqotKfvUkPz8fo0aN4qYXFhZq/ZyeGzduQCKRICQkhOte\n+fXXX7n58i6Ze/fucVcRFP8J8s1Xh6/8jDE8efIElpaW8PHxgY+PDw4dOoQNGzYgIiICMplM7Xz5\nwPO25v28tNyvSkpK4OTkpNV21FTeljHL67aoqIhraJaUlCA7OxuBgYG86+Pbpu3ZdzUdE4r4YuQr\nIwBERERg1qxZCAgIwPLly5GUlKQyNnNzcxgYGCgNhr916xYkEolSI/rp06c4cuSI2qtWQPu6wJqa\nmvDhhx/i9u3bSE1NbdWAaU+60aNHtxqMnZOTgz179iApKQnW1tZarUv+f/OPP/7gniEmFouRkJCA\nadOmqbxC1d5uwbbWy4QJE5CcnIza2lrujsHMzEx0794dY8eObbX8q6++yjWav/32Wzg4OCgNmq+q\nqgKAdt1lSVSjuwU7gaioKFy4cEHp9urBgwcjOzsbdXV1qKysRHx8vE761LOzs7nnDx09ehR5eXl4\n8803ATR3b7i6uiI2NhbFxcWQSqX46aefMH36dFy5ckWr/G1tbTFu3Dhs2rQJDx48gFgsRkpKCoqL\nizFr1iyt8pD/U71y5QrEYjGOHz/O3VV4//59jBgxAhYWFkhISEBtbS3Ky8uRmJjI/UjxzVeHr/xH\njx7FjBkzcP36dTDG8PTpU/z222/cGaem+e3N+3lR3B+OHTuG/Px8TJs2TavtyBezqakpSkpKUFNT\ng8GDB2PMmDHYsWMHHj58iJqaGmzYsAFnzpyBkZER7/r4tml79l1Nx4QivhhtbW01lhFoviJhZGSE\nrVu34urVq/jyyy9VxmRoaAhbW1ulBmReXh4MDQ2RkJCAzMxMHD58GMHBwaioqEBDQwPXPdhSr169\nMHz4cN6PKmvWrMGZM2cQFhbGPQZA/lF8lExGRgaGDRuG0tJSrdP17t0b//jHP5Q+8n3HycmJ+3vr\n1q2IjIxUu/3s7e1hbW2NLVu24Pjx4/jxxx/h7++PxsZGiEQindeJtuVrWSdA81iqbt264YMPPkBu\nbi4OHjyIuLg4BAcHKz2e4dq1a0hOTkZubi5OnDiBxYsXIysrCy2fGZ6Xl4ehQ4dyvw18dUW0R1eu\nOgFLS0tERkZizZo13DSRSITo6Gg4OzvD2toaIpEI58+fb/e65s+fj6SkJFy8eBG9e/fGqlWruEHQ\nALB582Z89tln8PX1hUQiwaBBgxAbG6u0DJ/Nmzdj/fr18PHxQUNDA2xtbfH111+3GoirzogRIxAa\nGopPPvkEMpkM//znPxEXF4eQkBAsXLgQSUlJSEpKQnR0NFxdXWFpaYmVK1dyz1Tq1q2bxvl8sasr\nP2MMpaWlWLJkCR4+fIgePXrA0dER27ZtA9A8iFzT/Pbk/bz4+flx+4OpqSlEIhE3wJ1vO/KV19/f\nH1u2bIGnpycyMzMRFxeH6OhoTJkyBcbGxhg/fjzWrVunVPea1se3Tdu67/IdE4r4YuQro5yVlRXW\nrl2Ljz/+GE5OTnBwcGi1jKurq9LjJm7evAlbW1tMnjwZq1atgpmZGcLDw3Ht2jWcOnUKFRUVWh9j\n2jp37hwAqOyeO3nyJDe+TCaTQSqVct1U2qbTRkVFhcoxcHJGRkZISEhAdHQ0li1bhp49e3JPNn/W\nLmJtaVO+lnUCNF+J2rt3L9auXYvQ0FC88sorCAoKwgcffNCqTPJjxsDAAE5OTkhNTW11hfLcuXNK\nY/AqKipw7949XRa1yzJg1OlKCGkDoVCITz/9lHvuGvlrKS0thZeXF/bs2YNx48YhMDAQAwYM6PRv\ncGjp0aNHWL16NeLj4/Udyl/K+fPnsWjRImRlZenkFWJEGXULEkJIJzRw4EAEBQVh+/btkEqlyM/P\n1/mVqRdBZmamykHpXVlTUxO2b9+Od999lxpWzwl1CxJCSCcVERGB9957DyKRCNXV1V3yJd2a3tXX\nVe3YsQMmJiZYvHixvkPptKhbkBBCCCFEh6hbkBBCCCFEh6hxRQghhBCiQ9S4IoQQQgjRIWpcEUII\nIYToEDWuCCGEEEJ0iBpXhBBCCCE6RI0rQgghhBAdosYVIYQQQogO/R9vC3Qpvci8YwAAAABJRU5E\nrkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# check distribution of postcode blocks\n", "pc_dist = matched_address.groupby('postcode_fodor').size().to_frame().rename(columns={0:'n_addresses'})\n", "\n", "f, ax = plt.subplots(1, figsize=(10,6))\n", "sns.kdeplot(pc_dist.n_addresses.values, color='g', shade=True, legend=False)\n", "ax.set_xlabel('Number of addresses in postcode block ($\\mu = {}$, $\\sigma = {}$).'\n", " .format(np.mean(pc_dist.n_addresses), np.round(np.std(pc_dist.n_addresses), 2)), size=15)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The postcode attribute looks like a sensible choice of blocking key because it contains just one missing value and there are very low numbers of candidate address comparisons within each block. As you can see in the output below, when we use a more sophisticated indexing technique we generate a far lower number of candidate address comparisons. In fact, we create only 1014 candidate address links despite adding synthetic non-matches (discussed below). Overall, our introduction of blocking substantially lowers the computational requirement of the linkage task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creation of synthetic non-matched addresses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make this exercise more realistic, let's also create 112 synthetic non-matches so we have 224 addresses in total. This will also be important for training our machine learning technique to learn the representations of non-matched addresses in addition to matches. In this case we use the FEBRL data set generator script `generate.py` to create an artificially generated dataset (see http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/node70.html). The script uses Python 2.7, so we read the output as JSON so the user does not have to rely on an external input. We do this in keeping with a self-contained notebook but describe the steps required to reproduce the non-matches below. \n", "\n", "The synthetic non-matches are essentially random permutations of the matched addresses. These are constructed on the basis of frequency tables for each address field that count the occurrence of particular values. For example, the first row of a frequency table for a house number would look like: $<$house_number_attribute_value$>$,$<$frequency_of_occurence$>$." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# first we need columns from the zagat and fodor databases to create random addresses\n", "zagat_cols = ['city_zagat', 'house_number_zagat', 'house_zagat', 'suburb_zagat', 'road_zagat', 'postcode_zagat']\n", "fodor_cols = ['city_fodor', 'house_number_fodor', 'house_fodor', 'suburb_fodor', 'road_fodor','postcode_fodor']\n", "\n", "# create a directory for address component frequencies\n", "if not os.path.exists('freqs'):\n", " os.makedirs('freqs')\n", "\n", "# create distributions of address components for both datasets that will be used to create fake addresses\n", "for cols in [zagat_cols, fodor_cols]:\n", " for col in cols:\n", " freq = matched_address[col].value_counts().reset_index()\n", " freq.to_csv('freqs/{}_freq.csv'.format(col), index=False, header=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`generate.py` takes six parameters that are used to create non-matched addresses. The first argument demarcates the number of original records to be generated; the second specifies the number of duplicate records from the original to be generated; and the third, fourth and fifth arguments define the maximal number of duplicate records that will be created based on one original record, the maximum number of modifications introduced to the address field, and the maximum number of modifications introduced to the address, respectively. The final parameter is used to enter which probability distribution will be to create duplicate records - i.e. uniform, poisson, or zipf. In our case we are only interested in building synthetic non-matches (and not duplicates), so we set the number of original records to be built as 112, the number of duplicates generated as 0, and leave the number of modifications introduced by the recommended default settings.\n", "\n", "In addition, for each address field, users are asked to define a dictionary inside `generate.py` that outlines the probability for particular modifications. This includes setting the probability of modifications such as misspellings, insertions, deletions, substitutions and transpositions of word and characters. An example dictionary for the house number address field is given below where we set the file path to the word frequency CSV generated above:" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "house_number_dict = {'name':'house_number',\n", " 'type':'freq',\n", " 'char_range':'digit',\n", " # 'freq_file':'freqs/house_number_fodor_freq.csv',\n", " 'freq_file':'freqs/house_number_zagat_freq.csv',\n", " 'select_prob':0.20,\n", " 'ins_prob':0.10,\n", " 'del_prob':0.16,\n", " 'sub_prob':0.54,\n", " 'trans_prob':0.00,\n", " 'val_swap_prob':0.00,\n", " 'wrd_swap_prob':0.00,\n", " 'spc_ins_prob':0.00,\n", " 'spc_del_prob':0.00,\n", " 'miss_prob':0.00,\n", " 'new_val_prob':0.20}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Damerau (1964) finds the proportions of typographical errors are typically spread as substitutions (59%), deletions (16%), transpositions (2%), insertions (10%) and multiple errors (13%). For this reason we broadly align our dictionary probabilities with these findings. After defining sensible probabilities for modifications, we execute the following scripts on a terminal which will create a file, `zagat_synthetic_addresses.csv` and `fodor_synthetic_addresses.csv` consisting of synthetic addresses from the Zagat and Fodor datasets, respectively. \n", "\n", "For simplicity we generate our non-matches using all the data at once. However, in a real-world application, we might wish to create non-matches within each zipcode block one at a time. This would create more realistic synthetic non-matches. This is because non-matched addresses would be constructed from the frequency tables of each zipcode block, meaning each non-match would share more commonality to actual matched addresses. In practice, this would improve the predictive power of our classification model to disambiguate between candidate address pairs that have very subtle differences yet are still matched or non-matched." ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Create 112 original and 0 duplicate records\n", " Distribution of number of duplicates (maximal 4 duplicates):\n", " [(1, 0.0), (2, 0.375), (3, 0.75), (4, 0.9375)]\n", "\n", "Step 1: Load and process frequency tables and misspellings dictionaries\n", "\n", "Step 2: Create original records\n", "\n", "\n", "Step 2: Create duplicate records\n", "\n", "\n", "Step 3: Write output file\n", "End.\n" ] } ], "source": [ "# ! python2 generate.py zagat_synthetic_addresses.csv 112 0 4 2 2 poisson" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Create 112 original and 0 duplicate records\n", " Distribution of number of duplicates (maximal 4 duplicates):\n", " [(1, 0.0), (2, 0.375), (3, 0.75), (4, 0.9375)]\n", "\n", "Step 1: Load and process frequency tables and misspellings dictionaries\n", "\n", "Step 2: Create original records\n", "\n", "\n", "Step 2: Create duplicate records\n", "\n", "\n", "Step 3: Write output file\n", "End.\n" ] } ], "source": [ "# ! python2 generate.py fodor_synthetic_addresses.csv 112 0 4 2 2 poisson" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then read these synthetic non-matches into a dataframe." ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [], "source": [ "# read parsed synthetic addresses\n", "synthetic_zagat_address = pd.read_csv('zagat_synthetic_addresses.csv').add_suffix('_zagat').drop(columns=['rec_id_zagat'])\n", "synthetic_fodor_address = pd.read_csv('fodor_synthetic_addresses.csv').add_suffix('_fodor').drop(columns=['rec_id_fodor'])\n", "\n", "# set uids for synthetic addresses\n", "synthetic_zagat_address['zagat_id'] = [str(uuid.uuid4()) for i in synthetic_zagat_address.iterrows()] \n", "synthetic_fodor_address['fodor_id'] = [str(uuid.uuid4()) for i in synthetic_fodor_address.iterrows()]\n", "\n", "# join synthetic zagat and fodor addresses vertically\n", "synthetic_non_matches = synthetic_zagat_address.join(synthetic_fodor_address)\n", "\n", "# remove whitespace from column names and attributes\n", "synthetic_non_matches = synthetic_non_matches.rename(columns = lambda x : x.strip())\n", "synthetic_non_matches = synthetic_non_matches.applymap(lambda x : x.strip() if type(x) == str else x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have generated synthetic non-matches, we need to join these back to our dataframe of matched addresses. As the above steps require external scripts we provide the JSON required to reconstruct the synthetic dataframe in the dedicated Github repository. This can be read by executing the cell below which uses the `pd.read_json` function." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2019-12-21 09:11:11-- https://raw.githubusercontent.com/SamComber/address_matching_workflow/master/synthetic_addresses.json\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.56.133\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.56.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 29098 (28K) [text/plain]\n", "Saving to: ‘synthetic_addresses.json’\n", "\n", "synthetic_addresses 100%[===================>] 28.42K --.-KB/s in 0.02s \n", "\n", "2019-12-21 09:11:11 (1.16 MB/s) - ‘synthetic_addresses.json’ saved [29098/29098]\n", "\n" ] } ], "source": [ "! wget https://raw.githubusercontent.com/SamComber/address_matching_workflow/master/synthetic_addresses.json" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "f = 'synthetic_addresses.json'\n", "\n", "synthetic_non_matches = pd.read_json(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cell below we join our matched addresses with our synthetic non-matches, creating a dataframe of 224 address pairs." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "224 address pairs created consisting of 112 matches and 112 synthetic non-matches.\n" ] } ], "source": [ "# align columns of matched_address dataframe for horizontal join\n", "matched_address = matched_address[['house_zagat', 'house_number_zagat', 'road_zagat', 'suburb_zagat', 'city_zagat', 'postcode_zagat','zagat_id', 'house_fodor', 'house_number_fodor', 'road_fodor', 'suburb_fodor', 'city_fodor', 'postcode_fodor', 'fodor_id']]\n", "\n", "# horizontal join between matched addresses and synthetic non-matches\n", "matches_with_non_matches = pd.concat([matched_address, synthetic_non_matches], ignore_index=True)\n", "\n", "print('{} address pairs created consisting of {} matches and {} synthetic non-matches.'.format(matches_with_non_matches.shape[0],\n", " matched_address.shape[0],\n", " synthetic_non_matches.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Blocking on postcode attribute" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With our matches and synthetic non-matches assembled into a dataframe with 224 address pairs, we can proceed to block on postcode values to create mutually exclusive address partitions. Thus, for every unique postcode value, a dataframe (or block) will be created in which candidate address pairs will be matched and non-matched based on attributes of their comparison vectors. \n", "\n", "The following code block creates a `MultiIndex` that links together the IDs of addresses that are within the same zipcode block." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1014 candidate links created using the postcode attribute as a blocking key.\n" ] } ], "source": [ "indexer = rl.Index()\n", "\n", "# block on postcode attribute\n", "indexer.block(left_on='postcode_zagat', right_on='postcode_fodor')\n", "candidate_links = indexer.index(matches_with_non_matches, matches_with_non_matches)\n", "\n", "# this creates a two-level multiindex, so we name addresses from the zagat and fodor databases, respectively.\n", "candidate_links.names = ['zagat', 'fodor']\n", "\n", "print('{} candidate links created using the postcode attribute as a blocking key.'.format(len(candidate_links)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We follow the same work flow as before and create comparison vectors for every 1014 candidate address links." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "candidate_link_df = return_candidate_links_with_match_status(candidate_links)\n", "\n", "comparison_vectors = return_comparison_vectors(candidate_link_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following this, we train our random forest on the comparison vectors and match status labels. We use a 75/25 split for our train and test data." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "X = comparison_vectors.iloc[:, 0:5]\n", "y = comparison_vectors.match_status\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)\n", "\n", "# create a random forest classifier that uses 100 trees and number of cores equal to those available on machine\n", "rf = RandomForestClassifier(n_estimators = 100, \n", " # Due to small number of features (5) we do not limit depth of trees\n", " max_depth = None, \n", " # max number of features to evaluate split is sqrt(n_features)\n", " max_features = 'auto', \n", " n_jobs = os.cpu_count())\n", "\n", "# predict match status of unseen address pairs\n", "y_pred = rf.fit(X_train, y_train).predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classification and evaluation of match performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having fit our random forest on the training data we can now assess the model under the number of metrics we introduced earlier. We can also produce a confusion matrix which shows true negatives in the top-left quadrant, false positives in the top-right, false negatives in the bottom-left and true positives in the bottom-right. At first glance, the findings from the evaluation metrics below may seem counter-intuitive, especially as the results of the classification exercise using the full index performed better. However, it is pertinent to remind ourselves that we trained our classification model on __matched address only__, which reflected an idealised but unrealistic scenario. In the results below we introduced synthetic non-matches which reflected a scenario that a user is more likely to encounter in a real-world address matching exercise. \n", "\n", "In the following code block we generate evaluation metrics and a confusion matrix for evaluating match performance." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precision score: 0.8519.\n", "Recall score: 0.8846.\n", "F1 score: 0.8679.\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAkEAAAGFCAYAAAD6lSeSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xl8TPf+x/H3ZCIiJJFIKKraey9K\n5aKh1lBLJNaLaqtEFb+2v5ZWSVHU0lqK1hKutK7bn7bUvWqPUqFa60UR1dLSjVpK9oXEEpP5/dF2\nrqklmExmTub17COPa86ZOfOZPG7k7fP9nHNMVqvVKgAAAA/j5eoCAAAAXIEQBAAAPBIhCAAAeCRC\nEAAA8EiEIAAA4JEIQQAAwCN5u7oAZzNF3u3qEgCPlLfhqKtLADxWGXPZYn0/R3/XWjedKqJKbk+J\nD0EAAMDJTCZXV3BHWA4DAAAeiU4QAABwjEFbKoQgAADgGIMuhxGCAACAY4yZgYzawAIAAHAMnSAA\nAOAYlsMAAIBHMui6EiEIAAA4hk4QAADwSMbMQEZtYAEAADiGThAAAHCMlzFbQYQgAADgGGNmIEIQ\nAABwEIPRAADAIxkzAzEYDQAAPBOdIAAA4BgGowEAgEcyZgYiBAEAAAcZdDCamSAAAOCR6AQBAADH\nMBMEAAA8kjEzECEIAAA4yKAzQYQgAADgGGNmIAajAQCAZ6ITBAAAHMNgNAAA8EjGzECEIAAA4CAG\nowEAgEcy6ISxQcsGAABwDJ0gAADgGJbDAACAR3JyBpo+fbr279+vK1eu6Nlnn1VYWJhGjBghi8Wi\n0NBQvfnmm/Lx8VFCQoLef/99eXl56fHHH1fPnj1velxCEAAAcIwTO0G7d+/W999/r6VLlyozM1Pd\nu3dX06ZN1bt3b3Xo0EHTp0/X8uXL1a1bN82bN0/Lly9XqVKl1K1bN7Vr107ly5e/4bGZCQIAAG6r\nUaNGiouLkyQFBgbqwoUL2rNnj9q2bStJatu2rXbt2qWDBw8qLCxM/v7+8vX1VcOGDZWUlHTTYxOC\nAACAY7wc/LoJs9ksPz8/SdKyZcvUsmVLXbhwQT4+PpKk0NBQpaamKi0tTcHBwbbXhYSEKDU1tdCy\nAQAA7pzJ5NjXLfj000+1fPlyjRs3TqarXmO1Wu3+9+rtpkKOTQgCAACOMTn4VYjt27frnXfe0YIF\nC+Tv768yZcro4sWLkqTk5GRVrFhRlSpVUlpamu01KSkpCg0NvelxCUEAAMAxXibHvm7i3Llzmj59\nuubPn28bcm7WrJkSExMlSRs3blRERITq1aunr7/+Wjk5OcrNzVVSUpIaNmx402NzdhgAAHBb69ev\nV2Zmpl566SXbtqlTp+rVV1/V0qVLVaVKFXXr1k2lSpVSbGysBg4cKJPJpEGDBsnf3/+mxzZZ/7iI\nVsKYIu92dQmAR8rbcNTVJQAeq4y5bLG+n+mFMIdeb537dRFVcnvoBAEAAMcY84LRhCAAAOCYws7C\ncleEIAAA4BCjhiDODgMAAB6JThAAAHCIQRtBhCAAAOAYL4OmIEIQAABwCDNBAAAABkInCAAAOMSo\nnSBCEAAAcAghCAAAeCSDZiBCEAAAcIxRO0EMRgMAAI9EJwgAADjEqJ0gQhAAAHCIyaC3kScEAQAA\nh9AJAgAAHsmgGYjBaAAA4JnoBAEAAIdwA1UAAOCRmAkCAAAeyaghiJkgAADgkegEAQAAhxi0EUQI\nAgAAjjHqchghCAAAOIQQBAAAPJJRQxCD0QAAwCPRCQIAAA4xaieIEAQAABxi0AxECAIAAI6hEwQA\nADySUUMQg9EAAMAj0QkCAAAOMepd5OkEoViV8i6lcTFDdXThNp1P+E6HFmzWc12etO2vWD5EC4fP\n1Mkle5W9+lvtmL1Kres3u+HxYns+K+umU+rX/tHiKB/wGOfPn1fkw1Hq0K6Tq0uBAZhMjn25Cp0g\nFKvZz01Qr4f/pmfjRirp+0Pq3KSd/j54ki5evqTFm1dqwxuLJUm9Jj+vtJwMvdJrkNZP/kANnovW\nkRM/2B2rWmgVjen9gis+BlDizYuLV2ZGpkIrhrq6FBgAM0FAIQL8/PV0x956ffEsLd+2Tj+d+Vlz\nVr2rTUnbFNO2hyIfjFCDv9RV/7eGaefhvTp68kc9M2uk8q9c0aMtO19zvLmDJ2rZtnUu+CRAyXb4\n0DdatWK1ojtFu7oUwKnoBKHY5OSdU5Ve4cq9mGe3PTkzTfX//IAS923V3U800um0M7Z9+VfylXk+\nWyEBwXav+VuzKLWo+5Bq9W+pZzr1KZb6AU9gsVg0+bXJerJ/X5lMJu3fu9/VJcEATKITBBQqLTtD\nFy5dtD0uU9pXbeo3154jB2QpsNgFIEn6659q656KVbXnyAHbtrK+fpoz6HUN/8ckpedkFlvtgCf4\n95KlOn8+VwOfGeDqUmAgJpPJoS9XcftOkMViUWZmpiwWi932SpUquagiFKV5L0xW+XIBmvrvedfs\nK1emrBaNnKOvjx3R0i0Jtu2v93tZx8+e1MLEpcVZKlDiJSenKH7O25oR95ZKly7t6nJgIEadCXLr\nELRw4ULNnj1bly9fltVqtW03mUz69ttvXVgZikL8i1MU07aHHpv0nH4687PdvsCyAdrwxmKFBgar\nZWxPWQp+DcH1/lxH/9u5rxoO6uiKkoESbfqU6WrVuqWaNGvs6lJgMAbNQO4fgj788EPVqVNHXl6s\n3JUUXl5eWvjyTD3asrN6vv6sEnZttNsfEhisTdP+pfJlA9Qqtqd+OH1M0q/hd/6QaXpr+Xx9e+J7\nV5QOlFjbtm5X0r4krUhY7upSgGLj1iEoJCREdevWdXUZKGJ/HzxJ3ZpFKWpUH23/eo/dPj/fMvpk\nyiL5eJdSs5e66Ux6sm1ftdAqaly7gcJrhmn0E4PtXvfusLf0z2FvqlT0vcXxEYAS59PET5WVla3I\nh6Ns2woKCmS1WhUe1kjPPPe0nn3+GRdWCHfGclgRSk7+9Rdfr169NGvWLHXt2lXlypWzew4zQcb0\ndMc+GhD1uKJGxVwTgCRpwdA3FRIQrCYvdlVyZqrdvl/Sk1X36bbXvObQgs0a98FbWvOfjdfsA3Br\nBg0ZpCf797XbtvRfy7Tlsy16e8E8BQcH3+CVACGoSLVq1Uomk8k2BzR//ny7/cwEGVNZXz9NHThK\n7274t46c/EGVguwvwvbnKtXVu003xUx9UZLs9l++kq/Mc1k6fPzodY99Ou3sDfcBKFylShVVqVJF\nu23BwUHy9vbWX2r8xUVVwSgIQUXoyJEjri4BThBe868KDiiv57v20/Nd+12zf/z7MyRJi1+Zc82+\nLQd3qfXL3BoDAFB0TNarT7tyMzk5OYqLi9OoUaPk7e2t5ORkzZs3T7GxsQoMDLylY5gi73ZylQCu\nJ28DnTnAVcqYyxbr+9Wa5djVxY8O3VBEldwetz7lasSIEfL2/m+zKjAwUIGBgRo5cqQLqwIAAFcz\n6sUS3ToEHT9+3NYFkiRfX1/Fxsbq+PHjri0MAADYEIKcwNvbWz/++KPdtkOHDrmoGgAAcD1GDUFu\nORj9u5EjR6p3796qUqWK/P39lZmZqbS0NM2Zc+3gLAAAwO1w6xAUERGhLVu2KCkpSZmZmQoKClJ4\neDj3tAEAwI0URzPnu+++0/PPP6+nnnpKMTExys/P1yuvvKKff/5ZZcuW1Zw5cxQYGKiEhAS9//77\n8vLy0uOPP66ePXve8JhuvRwWExOjMmXKqHnz5urcubOaN28uX19ftWzZ0tWlAQCA3zh7OSwvL08T\nJ05U06ZNbds++ugjBQUFafny5erYsaP27dunvLw8zZs3T++9954WLVqkf/7zn8rKyrrhcd2yE7R6\n9WqtWbNGhw8f1oABA+z2nT9/nvuIAQDgRpw91+Pj46MFCxZowYIFtm2ff/65Xnzx14vrPv7445Kk\nXbt2KSwsTP7+/pKkhg0bKikpSW3atLnucd0yBHXs2FH33nuvBg8erC5dutjt8/b2Vnh4uIsqAwAA\nf+TsEOTt7W13yRxJOn36tPbu3au4uDgFBARo/PjxSktLs7vFS0hIiFJTU/94uP8e12kVO8DHx0f1\n69fXmjVrVKFChWv2T548WWPGjHFBZQAAwB1YrVZVrlxZ7777ruLj4zV//nzVrl37mufcLKC5ZQj6\n3eXLlzV27FidPHlSBQUFkqTc3FwlJycTggAAcBOuOMs9JCREDRs2lCS1aNFCc+fO1cMPP6wtW7bY\nnpOSkqL69evf8BhuPVwzYsQIWSwWde3aVceOHVOXLl0UEBCg+Ph4V5cGAAB+44rrBLVs2VLbt2+X\nJB0+fFj33Xef6tWrp6+//lo5OTnKzc1VUlKSLShdj1t3glJSUrRo0SJJ0oIFC/Too4+qXbt2evnl\nl/Xuu++6uDoAACDJ6a2gQ4cOadq0aTp9+rS8vb2VmJiot956S9OmTdPq1avl4+OjadOm2e4sMXDg\nQJlMJg0aNMg2JH09bh2CzGazUlJSVLFiRXl5eSk7O1tBQUE6duyYq0sDAADFpG7duramyNVmzpx5\nzbbo6GhFR9/aDV3dOgT1799fkZGR2r9/v9q0aaM+ffqoatWqCgkJcXVpAADgN6689YUj3DoEPfro\no2rbtq28vb01dOhQ1apVS+np6ercubOrSwMAAL8xaAZy7xAk/Xo22LFjx1RQUKBKlSqpUqVK+umn\nn6576jwAACh+dIKcYPTo0fr4448VGhoqs9ls224ymZSYmOjCygAAwO8IQU6wY8cObdu2TeXLl3d1\nKQAAoIRx6xBUu3Ztw6ZLAAA8hVF/V7t1CHr++efVvXt3hYWFyc/Pz27fG2+84aKqAADA1Qyagdw7\nBI0cOVJ169ZVjRo17GaCAACA+6AT5CRz5sxxdQkAAOAmjBqC3PreYT179lRCQoIuX77s6lIAAEAJ\n49adoPfee09ZWVkaOXKkbTnMarXKZDLp0KFDLq4OAABIxu0EuXUIWrp0qatLAAAAhSAEOUHVqlVt\nf/7www/Vp08fF1YDAACux6AZyL1ngq724YcfuroEAABQgrh1J+hqVqvV1SUAAIDrYDnMyWJjY11d\nAgAAuA5CkJMcOHBAZ86ckcVi0dq1a23bu3Tp4sKqAADA7whBThAbG6vdu3fr3nvvlZfXf8eXTCYT\nIQgAADdh0Azk3iFo7969+vTTT1WmTBlXlwIAAEoYtw5Bd999N/cMAwDAzbEc5gTt27fX008/raio\nKPn7+9vtYzkMAAA3QQgqeps3b5YkffLJJ3bbmQkCAMB90AlygkWLFrm6BAAAUAgvY2Yg9w5BFy9e\n1HvvvaedO3cqPT1dFSpUUOvWrRUTEyMfHx9XlwcAAAzMrUPQa6+9ppycHD311FMKDAxUVlaWli1b\nppMnT2r8+PGuLg8AAIjlMKc4ePCg1q9fb7ft4YcfVteuXV1UEQAA+CMvQlDRs1qtunTpkkqXLm3b\nduXKFRdWBAAA/ohOkBO0b99eTzzxhLp3766AgABlZWUpISFB0dHRri4NAAAYnFuHoJdeekk1a9bU\n1q1blZGRoZCQEP3P//yPOnTo4OrSAADAb7wKf4pbcusQZDKZ1KlTJ3Xq1MnVpQAAgBtgJqgI9e3b\n96briyaTSe+//34xVgQAAG6EmaAi9MILL1x3e2pqqubOnav8/PxirggAANwInaAi9NBDD9k9vnz5\nshYuXKhFixapb9++6t+/v4sqAwAAJYVbhqCrbdy4UW+++aYaNWqkVatWKTQ01NUlAQCAq7AcVsSO\nHDmiyZMnS5Li4uJUp04dF1cEAACuh7PDitDYsWP1xRdfaNiwYYqKinJ1OQAA4CaYCSpCy5YtkyQN\nGTLkmhab1WqVyWTSt99+64rSAADAH7AcVoSOHDni6hIAAEAJd8MQNHbs2EJfPHHixCItBgAAGE+J\nWw6rVKlScdYBAAAMypgR6CYhaPDgwXaPz549q4yMDM7SAgAAdozaCSr0rLZTp07pkUceUZcuXfTM\nM89IkkaMGKEtW7Y4uzYAAACnKTQEvfzyyxo4cKD27t0rf39/Sb/e1mL27NlOLw4AALg/L5PJoS9X\nKfTssIyMDHXs2FHSf0+Bq1atGvfvAgAAkox7inyhnaCAgADt2rXLbttXX30lPz8/pxUFAACMo8R2\ngkaNGqUhQ4bI399fZ8+eVc+ePZWamqo5c+YUR30AAMDNGbMPdAshKDw8XBs3btTevXt17tw5VaxY\nUfXq1VPp0qWLoz4AAACnKDQEWa1W7d69W19++aWys7NVvnx5Xbp0SREREcVRHwAAcHMl9hT5cePG\nadq0abp06ZJCQ0OVl5enSZMmcbVoAAAgqQTPBO3cuVPr16+Xr6+vbdvQoUPVuXPnW7q1BgAAKNlK\n7NlhFSpUkJeX/dO8vb1VsWJFpxUFAACMozg6Qd99953atWunxYsXS5LOnDmjp556SjExMXrqqaeU\nmpoqSUpISNAjjzyiRx99VMuXL7/pMW/YCVq7dq0kqXHjxoqJiVFUVJSCg4OVnZ2tjRs3qkWLFrdU\nNAAAgCPy8vI0ceJENW3a1LZt9uzZeuyxx9SxY0d9+OGHWrhwoQYPHqx58+Zp+fLlKlWqlLp166Z2\n7dqpfPny1z3uDUPQRx99ZPtz6dKl7W6TYTab9cUXXxTBxwIAAEbn7MUwHx8fLViwQAsWLLBtGz9+\nvO1M9aCgIB0+fFgHDx5UWFiY7Q4XDRs2VFJSktq0aXPd494wBC1atOimBSUmJt72hwAAACWPs4eb\nvb295e1tH1l+v2izxWLRkiVLNGjQIKWlpSk4ONj2nJCQENsy2XWPW9gbWywWrV+/XidPnlRBQYEk\nKTc3V0uXLlVUVNQdfRgAAFByuOoML4vFohEjRqhJkyZq2rSpEhIS7PZbrdabDm0XOhg9atQovfPO\nOzpx4oQ++OADHT9+XImJiZo2bZrj1QMAANyhUaNGqXr16ho8eLAkqVKlSkpLS7PtT0lJUWho6A1f\nX2gISkpK0qpVqzR16lRVqFBBb731luLj47V9+/YiKB8AABidyWRy6OtOJCQkqFSpUnrxxRdt2+rV\nq6evv/5aOTk5ys3NVVJSkho2bHjDYxS6HHb1OlxBQYGuXLmi+++/X3v27LmjogEAQMlSaEfFQYcO\nHdK0adN0+vRpeXt7KzExUenp6SpdurT69u0rSfrzn/+sCRMmKDY2VgMHDpTJZNKgQYNsQ9LXU2gI\natq0qbp3764VK1bogQce0JgxY1SjRg3l5+cX3acDAACG5eyLJdatW7fQE7Z+Fx0drejo6Ft67i3d\nNmPQoEHy9vbWq6++Kh8fHx04cEDTp0+/pTcAAAAlW4m9bYbJZFL79u0lScHBwdwzDAAAlAg3DEEP\nPPBAoe2tQ4cOFXlBAADAWIx6F/kbhqCNGzcWZx0AAMCgjHoD1RuGoKpVqxZnHU5zYcN3ri4B8Ejn\n8rNdXQLgscqYyxbr+3k5/cYZzlHoTBAAAMDNGLUT5OxT+wEAANzSLXeCzp49q4yMDNWpU8eZ9QAA\nAIMx6mB0oZ2gkydP6pFHHlGXLl30zDPPSJJGjBihzz//3OnFAQAA92dy8D9XKTQEDR8+XAMHDtTe\nvXttl55+4YUXFBcX5/TiAACA+3PFvcOKQqEhKCMjQx07dpT038GnatWqcdsMAABgaIWGoICAAO3a\ntctu21dffSU/Pz+nFQUAAIyjxN42Y9SoURoyZIj8/f119uxZ9ezZU6mpqZozZ05x1AcAANycyaAn\nmxcagsLDw7Vx40bt3btX586dU8WKFVWvXj2VLl26OOoDAABuzqhnhxUagtauXWv3ODk52XZLjS5d\nujinKgAAYBhGvVhioSHoo48+snucnZ2tEydOqHnz5oQgAABgWIWGoEWLFl2z7cCBA0pISHBKQQAA\nwFhcea0fR9zRJFODBg30xRdfFHUtAADAgErs2WF/nAmyWCw6evSoLBaL04oCAADG4TEzQWazWaGh\noZo5c6bTigIAAHC2QkNQbGys6tevXxy1AAAAA/Iy6HWCCq16zJgxxVEHAAAwKKPeO6zQTlC7du30\n9NNPq1WrVgoMDLTbxynyAACgxM4EJSUlSZISExPttptMJkIQAACQl0FPkb9hCMrLy5Ofn991rxME\nAABgdDecCerZs2dx1gEAAAyqxM0EWa3W4qwDAAAYVIm7geqlS5d04MCBm4ahBx980ClFAQAA4zDq\nbTNuGIJSUlL08ssv3zAEmUwmbd682WmFAQAAY/AyGfM6QTcMQdWqVdMnn3xSnLUAAAAUm0JPkQcA\nALiZEnedoMaNGxdnHQAAwKBK3EzQhAkTirEMAABgVEY9O8yYk0wAAAAOYiYIAAA4pMQthwEAANwK\noy6HEYIAAIBDTCXtOkEAAAC3wqjLYcaMbgAAAA6iEwQAABzCTBAAAPBIJe6K0QAAALfCy6AzQYQg\nAADgEKN2ghiMBgAAHolOEAAAcAjXCQIAAB6JmSAAAOCRmAkCAAAwEDpBAADAIUa9bQYhCAAAOMSo\ny2GEIAAA4BBnDkbn5uZq5MiRys7OVn5+vgYNGqTQ0FBNmDBBklSrVi299tprd3RsQhAAAHCIM0+R\nX7Vqle677z7FxsYqOTlZ/fr1U2hoqEaPHq2//vWvGjJkiLZu3apWrVrd9rEZjAYAAG4rKChIWVlZ\nkqScnByVL19ep0+f1l//+ldJUtu2bbVr1647OjYhCAAAOMTk4H8306lTJ/3yyy+KjIxUTEyMRowY\noYCAANv+0NBQpaam3lHdLIcBAACHOHMwes2aNapSpYreffddHTlyRC+++KL8/Pxs+61W6x0fmxAE\nAAAc4sxT5JOSktSiRQtJ0v3336+8vDzl5eXZ9icnJ6tixYp3dGyWwwAAgENMJpNDXzdTvXp1HTx4\nUJJ0+vRplS1bVjVr1tS+ffskSRs3blRERMQd1U0nCAAAuK3HH39co0ePVkxMjK5cuaIJEyYoNDRU\n48aNU0FBgerVq6dmzZrd0bFNVkcW0wzgoiWv8CcBKHLn8rNdXQLgsUJ9Kxfr+6069m+HXt/9vl5F\nVMntoRMEAAAcwhWjAQCARzIZdMTYmFUDAAA4iE4QAABwCMthAADAIznzOkHORAgCAAAO8aITBAAA\nPJFRO0EMRgMAAI9EJwgAADiEwWgAAOCRjHqdIEIQAABwCJ0gAADgkbwYjAYAADAOOkEAAMAhLIcB\nAACPZNTrBBGCAACAQ4zaCWImCAAAeCQ6QQAAwCFcJwgAAHgkbqAKFJGcnHOKnxuvzzZ/rvS0dN1V\n+S716v24+vTtLS8vY/5rA3BH+fn5WvTuh9q47lOlpaTprip3qUevburxeDdJ0vGfftb8OQv09Zdf\nKy83T3+q8ScNfK6/mkY0cXHlcDcMRgNFZETsSP1y+hdNnPyaqt5dVdu37dC0KdNVUFCgfv2fdHV5\nQIkRN/3v2rzhMw0fG6tatWto57ZdmvVGnHx8fBTRurmGPDNMte6voVnvvCWf0j5a8t5SjXxxtN75\nYJ7qhNV2dflwIwxGA0Ug+WyyDn19SCNeeVmNmzbW3dXu1hN9eqlx08b6dOOnri4PKDHOnzuvtSs/\nVv9nn1Sb9g+rarWqeqxPTzVq0lAb123Svt37deniRY1741XVuL+Gqt9XXcNfHSYfHx9t+2y7q8sH\nigSdILiVSndV0o7d2667z2zm/65AUSlbrqxWb1qhMmV87bYHVQjSD0d/UNvoNmob3ea6rzWbzcVR\nIgzEqMthdILg1vLz87V65Rod2H9A/fr3dXU5QIlhMpkUFFxevleFoIsXLirpi6TrLnWdyzmnv898\nW75lSqtz947FWSoMwGQyOfTlKm7/T2uLxaLMzExZLBa77ZUqVXJRRSguT/bup6+/OqTyQeU17a03\n1Lpta1eXBJRoM6fM1vlz5xUzoLdt2/lz59WtXU9dvHhRNe+vobnvxqly1courBLuyMugPRWT1Wq1\nurqIG1m4cKFmz56ty5cv6+oyTSaTvv3221s6xkVLnrPKg5OdPXNWmZlZ2vr5Vr274P80YeJ4derC\nv0CN4lx+tqtLwC2yWq2aMXmW1q5ap9enT1CrthG2fQUFBfrl1C9KT8vQyn+v0oF9X2rW/Bn6c40/\nubBiFCbUt3iD6u6UrQ69vknFVkVUye1x6xDUsmVLxcfHq06dOnd8ajQhqGSYMX2mVq9co63/+ZzT\n5A2CEGQMFotFU8ZN0+ebtui1aeMU0brFDZ9rtVr1TJ/nVCG0gqbGTS7GKnG7CEG3xq1/m4SEhKhu\n3br80vMgv5z+RevWrteVK1fstv+lxl+Uk5OjrKwsF1UGlEyz3ojT9s93aObbb9oFoCPfHNV/tu2y\ne67JZNK9f66uUydOF3eZcHMmB/9zFbdMF8nJyUpOTlavXr00a9Ys/fjjj7Ztv3+hZDrx8wmNHjlG\n+/cl2W3//rvvVbZsWQUHB7uoMqDkWbN8rdat/kRT4yarfng9u33bPtuu8SNfV+75XLvtP/1wTFWr\nVSnOMmEADEYXoVatWslkMtnmgObPn2+3/3ZmgmAsjRo30gN162ji+Eka9epIVb+3ur74Yq+WLV2u\nPn17F34AALckLy9P78T9Q526d1T1++5Relq63f5uj3bV8iUrNfblCXp68ECVLeenhBUf6+g33+nZ\nt592UdVwV0Y9Rd6tZ4KKAjNBxpOVlaW5s/+uzzZ/rtzzuapatYq6dOuivv1iVKpUKVeXh1vETJB7\nO7DvS70w8KUb7t9xcIt++uGY3p49XweTvpIkVb+vuvoO7K2WbSJu+Dq4h+KeCdqbusOh1zcKvfEs\nmjO5dQjKyclRXFycRo0aJW9vbyUnJ2vevHmKjY1VYGDgLR2DEAS4BiEIcJ3iDkH7Unc69PqGoc2L\nqJLb45YzQb8bMWKEvL3/u2IXGBiowMBAjRw50oVVAQAAOyaTY18u4tYh6Pjx47YukCT5+voqNjZW\nx48fd21hAADAhrPDnMDb21s//vij3bZDhw65qBoAAHA9nB3mBCNHjlTv3r1VpUoV+fv7KzMzU2lp\naZozZ46rSwMAAAbn1iEoIiJCW7ZsUVJSkjIzMxUUFKTw8HCVLl3a1aUBAIDfGPUUebcOQTExMVq8\neLGaN7efGo+IiND27dtdVBVft7MyAAAQsElEQVQAALgaIagIrV69WmvWrNHhw4c1YMAAu33nz5/n\nNhoAALgRV871OMItQ1DHjh117733avDgwerSpYvdPm9vb4WHh7uoMgAA8Ed0goqQj4+P6tevrzVr\n1qhChQrX7J88ebLGjBnjgsoAAEBJ4ZYh6HeXL1/W2LFjdfLkSRUUFEiScnNzlZycTAgCAMBNGLUT\n5NbDNSNGjJDFYlHXrl117NgxdenSRQEBAYqPj3d1aQAA4DdGvU6QW4eglJQUTZkyRT169FC5cuX0\n6KOPaubMmYqLi3N1aQAA4DdcMdoJzGazUlJSJEleXl7Kzs5WUFCQjh075uLKAACA0bn1TFD//v0V\nGRmp/fv3q02bNurTp4+qVq2qkJAQV5cGAAB+Y9RT5E1Wq9Xq6iJuJiMjQ8HBwSooKNC6deuUkZGh\nzp07X/esseu5aMlzcoUArudcfrarSwA8Vqhv5WJ9v2+zDjr0+trl6xVRJbfHrTtBkhQcHCzp1+Ww\nP14zCAAAuJ5Rzw5zyxB0//3327XWfm9WmUwmWa1WmUwmffvtt64qDwAAXMWoy2FuGYJ69Oihw4cP\nKywsTB06dFCTJk1kNptdXRYAAChB3HYmKD8/Xzt27NDHH3+sgwcPqnnz5urUqZMeeuih2zoOM0GA\nazATBLhOcc8EfZd9yKHX1wysW+hzLl68qE6dOmnQoEFq2rSp7VqCoaGhevPNN+Xj43Pb7+u2p8iX\nKlVKrVu31owZM7Ru3To1adJES5YsUadOnTRp0iRXlwcAAH5THNcJevvtt1W+fHlJ0pw5c9S7d28t\nWbJEVatW1fLly++obrcNQVfLzc1VRkaG0tPTZbFY5Ofn5+qSAADAb5x9xegff/xRP/zwgx5++GFJ\n0p49e9S2bVtJUtu2bbVr1647qtstZ4KkX4PPpk2btHbtWv3www+KjIzUsGHD1KBBA1eXBgAAitG0\nadM0duxYrV69WpJ04cIF2/JXaGioUlNT7+i4bhmChgwZoqNHj6px48Z6+umn1bhxY8NOngMAUPI5\n73f06tWrVb9+fVWrVu2/73adM8jvhFuGoMTEREnS8ePHtXTp0ms+LKfIAwDgPpzZqNiyZYtOnjyp\nLVu26OzZs/Lx8VGZMmV08eJF+fr6Kjk5WRUrVryjY7tlCDpy5IirSwAAALfImRdLnD17tu3Pc+fO\nVdWqVXXgwAElJibqb3/7mzZu3KiIiIg7OrYhBqMBAID7Ku67yL/wwgtavXq1evfuraysLHXr1u3O\n6nbX6wQVFa4TBLgG1wkCXKe4rxN07Nx3Dr3+Pv+aRVTJ7XHL5TAAAGAcRj15iRAEAAAcwg1UAQCA\nRyIEAQAAj2TU5TDODgMAAB6JThAAAHAIy2EAAMAjGXU5jBAEAAAcYtROEDNBAADAI9EJAgAADjJm\nJ4gQBAAAHGLMCEQIAgAADmIwGgAAeChjhiAGowEAgEeiEwQAABxizD4QIQgAADjMmDGIEAQAABxi\n1MFoZoIAAIBHIgQBAACPxHIYAABwiFHvHUYIAgAADjFqCGI5DAAAeCRCEAAA8EgshwEAAIdwijwA\nAICB0AkCAAAOMepgNCEIAAA4yJghiOUwAADgkegEAQAAhxizD0QIAgAADjLq2WGEIAAA4CBCEAAA\n8EDGjEAMRgMAAA9FJwgAADjImL0gQhAAAHCIUQejWQ4DAAAeiRAEAAA8EsthAADAIdw7DAAAeChC\nEAAA8EDGjECEIAAA4CDODgMAADAQOkEAAMBBxuwEEYIAAIBDjBmBCEEAAMBhxoxBhCAAAOAQBqMB\nAAAMhBAEAAA8EsthAADAIUa9bYbJarVaXV0EAABAcWM5DAAAeCRCEAAA8EiEIAAA4JEIQQAAwCMR\nggAAgEciBAEAAI9ECMItq1WrlsaMGWO3bc+ePerbt6+LKrq+y5cva/Xq1YU+r1atWjp79mwxVAS4\nRq1atfTiiy9es3306NGqVatWoa/nZwklHSEIt+WLL77QN9984+oybuqbb765pb+4AU9w9OhRnT9/\n3vb48uXLOnTo0C29lp8llHSEINyWYcOGacqUKdfdV1BQoFmzZik6OlrR0dF65ZVXlJeXJ0nq27ev\nFi5cqCeeeEIREREaNmyYrnedzlOnTqlFixZasGCBoqKiFBUVpS+//FLPPPOMIiIiNGrUKNtzly1b\npg4dOqh9+/bq06ePTp8+rbS0NA0ePFhffvmlevfuLUnavn27OnXqpKioKD377LPKysqyHWPr1q3q\n0aOHWrRoof/7v/8rym8V4BYaN26sTZs22R7v2LFDYWFhds/hZwmeihCE29KhQwdZrVZt2LDhmn2f\nfPKJtm3bppUrV2r9+vXKycnRe++9Z9v/2WefaeHChUpMTNTu3buVlJR03ffIzMxUaGioEhMTVatW\nLQ0dOlRTp05VQkKCPv74Y504cULp6el6/fXXtXDhQm3cuFH33HOP4uPjFRISomHDhql+/fpasmSJ\n8vLyFBsbq1mzZikxMVH33HOP4uLibO91+vRprVy5Um+//bZmz56t/Pz8Iv+eAa7UoUMHffzxx7bH\n69atU3R0tO0xP0vwZIQg3LbRo0frrbfe0qVLl+y2b9myRd26dZOfn5+8vLzUo0cP7dy507Y/Ojpa\nvr6+8vPz07333qszZ85c9/hXrlyx/SVds2ZNhYWFKTg4WEFBQQoNDVVKSooqVKig/fv366677pIk\nNWzYUCdPnrzmWElJSapcubJq1qwpSRo+fLhdN6lr166SpDp16ujSpUvKzMx04DsDuJ+HHnpI33//\nvdLT03Xx4kUdOHBATZs2te3nZwmejBuo4rY98MADatSokRYuXKgGDRrYtmdkZCgwMND2ODAwUOnp\n6bbH5cqVs/3ZbDbLYrFo06ZNmjFjhiQpJiZGDz/8sMxms3x9fSVJXl5e8vPzu+Z1FotFc+fO1ebN\nm2WxWJSbm6v77rvvmlozMzMVEBBge+zj42O3//eazGazpF+X9ICSxGw2q3379vrkk08UHBysFi1a\nyNv7v3/187MET0YIwh0ZOnSoevToobvvvtu2LSQkxG5GICsrSyEhITc9TmRkpCIjI22PT506dUvv\nv379em3evFmLFy9WcHCwPvroI61du/aa5wUFBdn9i/TChQvKzs62/asX8AQdO3bUrFmzFBQUZJvv\n+R0/S/BkLIfhjlSsWFF9+vTR3LlzbdtatWqlhIQEXbhwQVeuXNGyZcvUqlUrp7x/enq6qlatavuL\nef369crNzZUkeXt76/z587JarQoPD1dqaqq++uorSVJ8fLzmzZvnlJoAd9WgQQOlpKTo+++/10MP\nPWS3j58leDJCEO7YgAED7IYfO3TooJYtW6pHjx7q3LmzKleurCeffNIp7925c2dlZWWpdevWio2N\n1dChQ3X27FlNmjRJ4eHhSklJUUREhHx8fDR37lwNHz5cUVFROnr0qIYOHeqUmgB3ZTKZFBkZqWbN\nmsnLy/6vfX6W4MlM1uudpwwAAFDC0QkCAAAeiRAEAAA8EiEIAAB4JEIQAADwSIQgAADgkQhBAADA\nIxGCAACARyIEAQAAj0QIAgAAHokQBAAAPBIhCAAAeCRCEAAA8EiEIAAA4JEIQQAAwCMRggAAgEci\nBAEAAI9ECAIAAB6JEAQAADwSIQgAAHgkQhAAAPBIhCDA4GrVqqXIyEhFR0crKipKjzzyiHbt2uXw\ncePj4/XKK69Ikvr166fDhw/f9PkfffTRbb/Hvn371KZNm2u279mzR5GRkYW+vk2bNtq3b99tvecr\nr7yi+Pj423oNgJLJ29UFAHDcokWLdNddd0mS9u/fr+eee04bNmxQcHBwkRz//fffv+l+i8Wi6dOn\n67HHHiuS9wOA4kAnCChhwsPDdc899+jAgQM6deqUWrRooSlTpigmJkbSryHpkUceUWRkpB577DGd\nPHlSknTx4kW99NJLat26tWJiYnT27FnbMa/uuKxatUpRUVGKiorS8OHDdfnyZfXv31/nzp1TdHS0\nTp48qeTkZP3v//6v7Xlbt261HSs+Pl6tWrVS9+7d9Z///KfQz3PhwgW99NJLioqKUps2bTRt2jS7\n/bt371a3bt3UqlUrzZo1y7Z98+bN6tKli9q2basBAwYoIyPjzr+pAEokQhBQAl25ckU+Pj6SpKys\nLNWuXVuLFy9Wbm6uhgwZomHDhmnTpk168sknNWTIEEnSihUrlJaWpk2bNmnu3LnasWPHNcc9deqU\npk+frg8++EAbNmzQhQsX9MEHH2jKlCkym83asGGDqlWrpnHjxun+++9XYmKi/vGPf2jEiBHKzMzU\nDz/8oPfee08rVqzQ8uXLdfTo0UI/y7/+9S/l5uZqw4YNWrVqlVauXGm3BHb48GGtWLFCK1eu1L/+\n9S8dOXJEZ86c0ahRozRjxgxt3rxZjRs31oQJE4rmmwugxCAEASXM1q1blZaWpgcffFCSlJ+fb5uv\n2bdvn8qWLavmzZtLkjp37qwTJ07ol19+0b59+xQZGSlvb28FBQWpdevW1xx7586datCggSpVqiST\nyaQZM2boqaeesntOXl6etm7dqt69e0uSqlevrvDwcG3dulV79+5Vo0aNFBISIrPZrK5duxb6eQYM\nGKD4+HiZTCYFBgaqRo0aOnXqlG1/ly5dZDabVaFCBTVq1EgHDhzQZ599prCwMNWsWVOS9MQTT+iz\nzz6TxWK5/W8ogBKLmSCgBOjbt6/MZrOsVquqVq2qBQsWqGzZssrMzJTZbFa5cuUkSTk5OUpOTlZ0\ndLTttT4+PsrIyFB2drb8/f1t2wMCApSbm2v3PpmZmQoICLA9Ll269DW1nDt3TlarVU8++aRtW15e\nnpo0aaK8vLxr3qMwx48f19SpU/XTTz/Jy8tLZ8+eVY8ePWz7r5578vf3V05OjqxWqw4ePGj3OcuV\nK6esrKxC3w+A5yAEASXA1YPRN1OxYkX96U9/0sqVK6/ZFxAQoHPnztkeX2+GJigoSAcOHLA9Pn/+\nvC5evGj3nAoVKshsNmvFihUqW7as3b4lS5bYvUdmZmahNb/++ut64IEHNG/ePJnNZvXq1ctuf3Z2\ntt2fAwMD5ePjo2bNmmnOnDmFHh+A52I5DPAg9erVU2pqqg4ePChJOnnypIYPHy6r1ar69evblowy\nMjK0bdu2a17fqlUrJSUl6dSpU7JarRo/fryWL1+uUqVKqaCgQOfPn5e3t7datmypf//735J+HWwe\nNWqUzpw5owcffFD79+9XRkaGLBaLEhISCq05PT1dtWvXltls1s6dO/Xzzz/bdajWrVungoICpaen\na//+/QoPD1fz5s21b98+29D3V199pUmTJhXFtxBACUInCPAgvr6+mjNnjiZOnKjc3FyVKlVKQ4YM\nkclk0mOPPaZ9+/apXbt2qlKlitq1a2fXtZGku+66S6+//rr69esns9mssLAw9e/fX6VKlVJ4eLha\nt26t+fPn67XXXtP48eO1bNkySVLXrl1VuXJlVa5cWb169VL37t1Vvnx5derUSd99991Na37uuec0\nadIk/f3vf1dkZKQGDx6smTNnqk6dOpKksLAw9ezZUxkZGerXr59q1KghSZo4caIGDRqk/Px8lS1b\nVqNHj3bCdxSAkZmsVqvV1UUAAAAUN5bDAACARyIEAQAAj0QIAgAAHokQBAAAPBIhCAAAeCRCEAAA\n8EiEIAAA4JEIQQAAwCMRggAAgEf6f2pJ2ZjZ//iCAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print('Precision score: {}.'.format(np.round(precision_score(y_test, y_pred), 4)))\n", "print('Recall score: {}.'.format(np.round(recall_score(y_test, y_pred), 4)))\n", "print('F1 score: {}.'.format(np.round(f1_score(y_test, y_pred), 4)))\n", "\n", "f,ax = plt.subplots(1, figsize=(10,6))\n", "f.set_tight_layout(False)\n", "\n", "fontsize=12\n", "sns.heatmap(confusion_matrix(y_test, y_pred), \n", " ax=ax, \n", " annot=True, \n", " annot_kws={'fontsize': 16}, \n", " cmap='Greens', \n", " fmt='g')\n", "ax.set_yticklabels(['Match', 'Non-match'], fontsize=fontsize)\n", "ax.set_xticklabels(['Non-match', 'Match'], fontsize=fontsize);\n", "ax.set_ylabel('True label', fontsize=fontsize)\n", "ax.set_xlabel('Predicted label', fontsize=fontsize)\n", "ax.xaxis.labelpad = 18\n", "ax.yaxis.labelpad = 18\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall, our precision value implied 85% of true positives were correctly separated from false positives, and our recall value indicated that 88% of all true address matches were successfully retrieved, with the remaining 12% incorrectly classified as non-matches. With our model now fitted and tested, we could extend its use to predict the match status of unseen address pairs. As an example application, if we had a small sample of matched addresses that belonged to a larger set of unmatched addresses, we could use our trained predictive model to match the remaining addresses in the dataset. This would work so long as the textual representations of addresses used in the prediction stage follow a similar structure to those addresses used to train the classification model.\n", "\n", "Before we conclude, a benefit of using ensemble methods such as random forest classifiers is that we can return an indication of how useful and valuable each feature was in the construction of each decision tree. In a practical application, extracting a measure of feature importance might be a useful step in pruning redundant features from the comparison vectors. This might be a useful step in lowering computation times as we decrease the number of address field comparisons required to evaluate candidate address pairs. \n", "\n", "Thus, in the following code block we rank feature importance of particular address fields to the match classification. " ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlEAAAF4CAYAAAB0N6y9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xl4Tnf+//FXVltQu1JbiyASIrHH\nFolYElp0RG0d1dY0qK21TqPVovMd+1YzNbrwNaFShrbodFFqD4NWdSxfYiciiQhZP78//HKPm2zO\nyEKfj+tyXTn32d7nc8753K/7nHPfHIwxRgAAAHggjoVdAAAAwKOIEAUAAGABIQoAAMACQhQAAIAF\nhCgAAAALCFEAAAAWEKLwm3f+/Hl5enpq9+7dhV3Kb05ycrJefvllNW3aVH/5y18eyjLPnDkjd3d3\n7dmzJ9tp+vfvr4kTJz6U9RVV9x7XgwYN0vjx4wu5qsffwoULFRgYWNhloIAQopAngwYNUsOGDeXp\n6Xnfv/fff/+hrWfDhg06e/bsQ1teXlSvXl1HjhxRq1atCnS9ebF06VJlZGQUdhn5ZseOHfrhhx8U\nERGhl19+ubDLKbJu3rypFStW2L22cuVKxcXFZTtPfhzX+/fv165dux7a8u516dIlffbZZ/m2fKsK\no1/Co4EQhTzr0aOHjhw5ct+/CRMmPJTlG2M0c+ZMOqv/79dff9W8efMe6xCVkJAgSXrmmWfk4OBQ\nyNUUXXv27LELUQkJCZoxY4bi4+MLtI6PP/44X6/Yfv3111q3bl2+Ld8K+iXkhBCFhyY5OVlz586V\nv7+/mjRpom7dumn9+vV20yxfvlxBQUHy9vZWhw4dNHfuXBljlJSUJE9PT12/fl2vvPKKhg8fLkly\nd3fX2rVrbfOnpaXJ3d1dkZGRkqSJEydq5MiReuONN+Tt7W3r6NavX6+QkBA1bdpU7du31//8z/8o\nLS0ty7rPnTsnd3d37dy5U9Kdq27vv/++3nvvPfn6+qpNmzZau3at9u/fr549e6pp06YaOHCgLl++\nLOnOG5y7u7u2bdumnj17ytPTU4GBgXaf2G/duqUZM2YoICBA3t7e6tWrl/7xj3/Yxi9cuFC9e/fW\n7Nmz1axZM23btk29e/eWJHl7e2v58uWSpG3btqlv377y8fFRq1atNGbMGMXGxtqW4+7uro0bN2rU\nqFHy8fGRn5+fPvjgA7vtXbFihQIDA+Xt7a0+ffrYtluSLl++rHHjxqlt27Zq2rSpBgwYoEOHDtnG\n//zzzxo0aJCaN28ub29vhYaGav/+/dkeE+fPn9eIESPk5+enli1batCgQTpy5Igk6S9/+YumTp1q\n28YlS5ZkuYzMfent7S0/Pz+99dZbun37tm38vn371KtXLzVt2lTPPfecfvrpJ7v5ExISNHr0aDVv\n3lx+fn733TaMjIxUy5YttWrVKvn6+urzzz+XJP3rX//SoEGD5Ovrq+bNm2vMmDG6evWqbb4vvvjC\nVleLFi00YsQI2zFx/fp1jRs3Tm3atFHTpk3VtWtXrVmzJtt2unr1qsaMGaO2bdvK29tbvXv3tu2X\n1atX25bt6empRYsWqU2bNkpPT1ePHj00ffp02zEcERGhjh07avLkyfcd19KdQPD++++rVatWatWq\nlaZMmWJry8jISLm7u9udJ2vXrpW7u7skKTQ0VFu3btVf//pX+fr6SpLS09O1YsUKBQUFqUmTJurc\nubM+/PBD2/zJycmaNm2a/Pz81KRJE/n7++uDDz5QVv9RxuzZszVjxgwdPHhQnp6eOnz4sCRp69at\n6t27t5o1ayZ/f39NmzZNiYmJ2balu7u71q9fr5deesnW9ocPH9bq1avVsWNH+fj4aOLEiUpPT7fN\n86D9UkxMjMaNG6fmzZurVatWGjdunN15KEn//Oc/FRQUpMaNG6tPnz46depUtjXjEWaAPBg4cKAZ\nN25cjtO88cYbpm/fviY6OtqkpqaarVu3mkaNGpm9e/caY4zZvHmz8fDwMEeOHDHGGHPkyBHj5eVl\nPvvsM2OMMWfPnjX169c3P/74o22Z9evXN2vWrLENp6ammvr165t169YZY4yZMGGCadmypVmxYoVJ\nTU01GRkZZu3ataZFixZm3759Jj093fzyyy+mY8eOZuHChVnWfe96Bw4caFq3bm2++OILk5qaahYv\nXmyaNm1qRowYYa5du2bi4+NNSEiImTlzpjHGmN27d5v69eubgQMHmujoaHPz5k3z3nvvmSZNmpgb\nN27Y6gwODjYnTpwwqamp5uuvvzYNGzY033//vTHGmAULFpgWLVqYWbNmmeTkZJORkWHWrVtn6tev\nb1JTU40xxly+fNl4eHiYlStXmvT0dHPlyhUTHBxs3nzzTbv26tKli9m3b59JS0szERERpn79+ubX\nX381xhjz97//3bRq1cocPnzYpKammlWrVpnGjRub06dPm+TkZBMYGGimTJliEhISTFJSkpk9e7bx\n9fW1bUdQUJCZM2eOSU5ONrdv3zYffPCB6dChg0lLS7uvXVNTU01gYKAZPXq0iY2NtS2vadOm5tq1\na8YYc9823uvw4cOmfv365ttvvzXGGHPmzBnTtm1bM3/+fGOMMYmJicbHx8dMnz7d3Lp1y0RHR5t+\n/fqZ+vXrm927dxtjjJk0aZLp0qWLiY6ONklJSeZPf/qTadq0qZkwYYKthiZNmphJkyaZxMREk5GR\nYY4fP268vLzM6tWrTUpKirly5YoZOnSoGTRokDHGmEuXLtn2X0ZGhomNjTVhYWFm7Nixxhhj/vjH\nP5rf//73Jj4+3qSnp5sdO3aYJk2amOPHj2e5na+++qoZNGiQuXHjhklJSTFz5swx3t7etnZfsGCB\nadeunW36zGPu9OnTdsdw//79zcWLF01GRkaWx3WzZs3MRx99ZG7fvm2OHTtmWrVqZWbNmpXtvliz\nZo2pX7++bbhTp05mzpw5tuF58+aZzp07m2PHjpm0tDSzb98+06xZM/P5558bY4xZtmyZCQ4ONleu\nXDEZGRnm8OHDpnXr1mbbtm1ZtsOECRNMaGiobXjPnj2mQYMGZtOmTSY5OdlER0ebXr16mddffz3L\n+Y25cw4EBwebX375xSQnJ5thw4aZDh06mFmzZplbt26Z48ePm8aNG5tvvvnGGGOtX+rXr58JCwsz\n169fN3FxcebFF180Q4YMse0rHx8f884775gbN26Yq1evmh49epgRI0ZkWzMeXVyJwkMRFxenjRs3\n6vXXX1eNGjXk7OyswMBA+fv72z6BBwQEaPv27WrcuLEkqXHjxqpXr57dlQ4rHBwcNHjwYDk7O8vB\nwUErV65Uv3795OvrK0dHRzVo0EBDhw61u6KVm5o1a6p79+5ydnZWly5dlJSUpAEDBqh8+fIqU6aM\n/Pz8dOLECbt5Bg4cqBo1aqhkyZIKCwtTcnKyfvjhByUmJmrDhg0KCwvTM888I2dnZwUEBKh9+/aK\niIiwzZ+QkKDXXntNrq6uWd7aqly5srZv367Q0FA5OjqqUqVKateu3X3t17lzZ/n6+srJyUkhISGS\n7twalO5c1ejVq5c8PT3l7OysF154QTNnzpSrq6t++OEHnT9/XpMnT1bp0qVVokQJjRkzRk5OTvrq\nq69sNbq4uMjFxUXFihXTq6++qu+//15OTk731bt9+3ZFR0dr6tSpKleunEqUKKGRI0fK1dXV7ipc\nTho3bqxdu3apU6dOtv3i4+Nj2+bM9h05cqSKFy+uGjVq6MUXX7RbxldffaUXXnhBNWrUUIkSJfT6\n66/LxcXFbppbt27pxRdfVKlSpeTg4KA1a9aoYcOGCg0NlYuLiypVqqQ333xTe/bsUXR0tBITE5We\nnq4SJUrIwcFB5cqV08KFCzV79mxbOzk6OqpYsWJydHRU27ZtdfDgQdWtWzfL7Zw3b56WLl0qNzc3\nubi4KCQkRDdv3rzvGMtN9+7dVbVq1WxvjVapUkVDhgxRsWLF5O7urpCQEP3zn/98oHVkysjI0P/+\n7//q5Zdflru7u5ycnOTr66vnn3/eds7Hx8fL0dFRxYsXl4ODgzw9PfXjjz+qffv2eVrHypUr1b59\ne/Xo0UOurq6qUaOG/vCHP2jz5s053srs1KmTGjRoIFdXV3Xs2FExMTEaPXq0ihcvrrp168rd3d3W\ntg/aLx07dkwHDx7UyJEj9cQTT6hs2bJ6++231b9/f9sVtqSkJI0dO1Zubm6qWLGi/Pz8dPz48Ty3\nLR4dzoVdAB4dX3zxhbZs2XLf62+//baeeeYZZWRkaPjw4XYduDFGTZo0kSSlpKRo4cKF+uabb2yX\nvlNTU7N9Y8mr6tWry9HxP58HTp06pePHj9s9Q5LZuaWkpMjV1TVPy8xUvHjx+14rUaKEkpOT7eZ5\n5plnbH+XLVtWZcqU0cWLF3X27FllZGSoXr16902/fft22/ATTzyh0qVL51jXhg0btGbNGl24cEHp\n6elKT09X1apV7aapVauWXZ2SbLdszpw5o759+9pNHxwcLEnauHGj0tLS1LJlS7vxGRkZOn/+vCRp\nwoQJeuedd7Ru3Tq1bt1a/v7+8vf3t2v/TGfOnFH58uVVoUIF22suLi6qWbOmzp07l+N23r3uTz75\nRJs2bdKVK1dkjFFaWprtdtLFixdVpkwZlS1b1jbP3e18/fp1JSUl6amnnrK95urqatdGmWrUqGH7\n+9SpUzp06JA8PT3tpnFyctK5c+fUpk0bDR48WC+++KLq16+v1q1bq2vXrrZjffjw4XrttddstzH9\n/PwUHBwsNze3LLfz3//+t+bNm6eff/5ZN2/etL1+7zGWm7u3ISv3HoNPPfWULl68+EDryBQbG6u4\nuDhNnz5d7777ru11Y4wqVaok6c4Hix07dqhdu3Zq3ry52rZtq5CQELtjIidnzpyRn5+f3Wt169aV\nMUbnz5+32+93u/dcrVixoooVK2b3WmbbPmi/dPr0aUmyO6Zq1qypmjVr2obLly+vUqVK2YaLFSum\nlJSUvGwyHjGEKORZjx499Oc//znLcceOHZMkrVmzRo0aNcpymnfeeUc7duzQ4sWL5eHhIScnJ/Xr\n1++BasjqIet7ryoUL15cr7322n1XJB5EVp/kc3vw+e5nLKQ7byZ3hwtzz3MgGRkZSk1NtQ3fux33\n+vzzz/WnP/1J77//vrp06aJixYpp9uzZ+uKLL+ymyyrQ3D3u3joyFS9eXG5uboqKisp2/l69eikg\nIEC7du3Sjh07NGXKFNWrV08ff/xxllejslrXvdudk6VLl+qTTz7R/Pnz1apVK7m4uGjs2LG2Z5NS\nUlLu2y93HyOZb1z3tklux1Hx4sXVsWNHLV26NNvapkyZomHDhtm+YThgwAC99NJLGjNmjBo0aKCv\nv/5aBw4c0I4dO7RixQotWrRIERERdm/wknTjxg299NJLat++vTZt2qRKlSrp1KlT6tatW57aKLtt\nyEpWx+jd4SK36e+W+eFi7ty52X6l/8knn9SGDRt0+PBh7dy5Uxs2bNDChQv10Ucf3RdQs5PVeSMp\nx2Po3v2d0znxoP1S5nGe3XmU2/rweGFP46GoUaOGnJycdPToUbvXL1y4YHtQ9eDBgwoKCpKXl5ec\nnJzydLuiWLFidg8RnzlzJtdaateufV8d165ds/uEnx/uri0uLk4JCQl68sknVaNGDTk6OtpuqWU6\nfvy4ateuneflHzx4UM8884xCQkJsb3wPeiu0du3aOnnypN1rq1at0rFjx1S7dm0lJiYqOjrabvzd\n30qKjY1VqVKlFBAQoGnTpmnt2rXat2+fLUTfu67r16/rypUrttdSUlIUHR2tOnXq5KnegwcPqkWL\nFmrXrp1cXFyUkZFh9+B41apVdePGDbsHje9u5woVKsjFxUUXLlywvZacnJzrcVS7dm39+uuvdmEr\nOTnZ9uB4RkaG4uLiVKVKFfXp00fz589XeHi4Pv30U0l3budlZGTYHkjfuHGjihcvrq1bt963rpMn\nTyohIUFDhw61XcHJfKj6Ybt3u8+dO6cnn3xS0n9CUV7Pt8xbVfeea5cvX7aF16SkJN2+fVteXl4a\nPny4IiMj1bBhQ23YsCFP9daqVSvL88bR0dHuys9/40H7pcxz9u4HxaOjo/W3v/0t2y+v4PFFiMJD\nUapUKfXt21dLlizR0aNHlZ6ern379um5557Tl19+KenOJe+jR48qKSlJ58+f19SpU1WtWjVdvHhR\nxhiVLFlS0p3O6caNG5Kkp59+Wv/85z+VlJSk2NhYLVmyJNdP20OGDNGXX36pr776SqmpqTp79qxe\neeUVzZw5M1/b4NNPP9W5c+d069YtLV68WCVLllS7du3k5uam5557TosWLdLp06eVmpqqL7/8Uj/+\n+KNCQ0OzXV7mrbgTJ04oMTFRNWvW1KVLl3T+/HnFx8dr0aJFSkpKUlxcnJKSkvJU4wsvvKBNmzZp\n3759SktL0/r16zVr1iwVL15cbdu2Vb169TRt2jTbG+Hq1avVvXt3nT17VhcuXFD79u21ceNGpaSk\nKC0tTVFRUSpWrJiqVat237r8/PxUu3Ztvfvuu0pISNDNmzf15z//WRkZGerevXue6q1Zs6ZOnTql\n69evKyYmRm+//bbc3Nx05coVpaWlyc/PT87Ozlq8eLFu376t06dPa+XKlbb5nZ2d1aFDB61atUoX\nLlzQzZs3NXfu3Fx/NqJ///66evWq5s2bp8TERMXHx+vtt9/W4MGDlZGRoU2bNik4OFiHDx+WMUY3\nb97UTz/9pKefflrGGD3//PP685//bDuOjx8/rvj4eD399NP3ratatWpycnLSgQMHlJqaqp07d9pu\nm2feaitRooQSEhJ0+fJlJSUl2Y6NkydP5vhNtXudP39ef//735WSkqKjR49q48aNtitembVt2rRJ\n6enpOnDggL799lu7+UuUKKHo6GjduHFD6enpGjJkiFatWqVdu3YpPT1dx44d0wsvvGD7NmlYWJgm\nT56sa9euSboTyi5evJhtiC5RooQuX76suLg43b59W4MHD9aPP/5ou9X8f//3f1qyZIm6dOmicuXK\n5Xm7c/Kg/VK9evXUvHlzzZs3TzExMbpx44Zmzpypbdu2ydmZmzu/NYQoPDSTJk1Sx44dNWzYMDVr\n1kxvvfWWRo0apZ49e0qS3nzzTSUnJ6t169Z65ZVX9Nxzz2nEiBE6cuSIXn75ZZUvX14hISGaNWuW\nhg0bJkmaOnWqYmJi1KpVKw0aNEi/+93vsn2uJFOPHj305ptvau7cuWrWrJkGDhwob29v29fp88vv\nfvc7hYWFqUWLFtq2bZuWLVtmey5i8uTJ8vX11e9//3u1bNlSH374oRYuXKgOHTpku7w2bdqoYcOG\ntnDav39/tWjRQsHBwQoODlbx4sU1e/ZslSlTRp06dbK7gpCdPn36aNSoURo/frx8fX31ySefaMmS\nJapdu7acnJy0dOlSFStWTN26dVPr1q21YcMG/eUvf1GNGjVUrVo1zZ07V8uXL1eLFi3UqlUrRURE\naOnSpVm+oTk7O2vJkiW6ffu2goKC5O/vr5MnT2r16tWqXLlyntp0+PDhqlq1qjp16qR+/frZ9mNc\nXJxCQkJUsWJFLV26VDt27FDLli31+uuv69VXX7VbxvTp01WnTh317NlTQUFBKlu2rO2ZquxUr15d\ny5Yt065du9SmTRsFBQUpPj5eH374oRwdHRUSEqIBAwZo9OjRtq/2x8TEaM6cOXJwcNDixYv173//\nWx07dpS3t7fGjx+vkSNHZrm/K1eurClTpmjZsmVq0aKFVq5cqffee0/dunXTW2+9pQ0bNqhLly6q\nVKmSOnfubLua06pVK40aNeqBjuvu3bvr5MmTateunYYOHaouXbrYfuS0QYMGGj58uObPny9fX1+t\nWLFCr732mt38L7zwgr7//nt17txZ169f10svvaQBAwZo0qRJatq0qcLCwvTcc8/Z9sGsWbOUkpKi\nbt26qUmTJho2bJh69uyp/v37Z1lfz549lZKSog4dOmjHjh3y9fXVzJkz9eGHH6p58+YaNmyY2rVr\np1mzZuV5m3NjpV9atGiRSpcurS5duiggIMB2LuK3x8HkdGMXQK727NmjwYMHa+vWrVk+sAwAeDxx\nJQoAAMACQhQAAIAF3M4DAACwgCtRAAAAFhCiAAAALCjwH7XI6deQAQAAihofH58sXy+UXwbLrhjk\nj6ioKNq8gNHmBY82L3i0ecGjzQteThd/uJ0HAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCi\nAAAALCBEAQAAWECIAgAAsIAQBQAAYAEhCgAAwII8/bcvM2bM0KFDh+Tg4KDJkyfLy8vLNm7VqlX6\nxz/+IUdHRzVu3FhTpkzJt2IBAACKilxD1N69e3XmzBlFREToxIkTmjRpktauXStJSkxM1PLly7V1\n61Y5Oztr6NCh+te//qWmTZvme+EAAACFKdcQtWvXLgUEBEiS6tatq4SEBCUmJsrNzU0uLi5ycXFR\nUlKSSpYsqVu3bqls2bL5XjTyrnbt2kpJSdGFCxcKuxQAAB4ruYaomJgYeXh42IYrVKigq1evys3N\nTcWKFVNYWJgCAgJUvHhx9ejRQ3Xq1Ml1pTn9j8h4uFJSUiTR5oWBNi94tHnBo80LHm1edOQaoowx\n9w07ODhIunM7b9myZdq8ebPc3Nw0ZMgQHTt2TA0aNMhxmT4+Pv9FyXgQrq6uSklJoc0LWFRUFG1e\nwGjzgkebFzzavODlFFpz/XZelSpVFBMTYxu+cuWKKlasKEk6efKkatSoofLly8vV1VW+vr766aef\nHkLJAAAARVuuIapt27basmWLJOno0aOqXLmy3NzcJEnVq1fXyZMndfv2bRlj9NNPP6l27dr5WjAA\nAEBRkOvtvGbNmsnDw0OhoaFycHBQeHi4IiMjVbp0aQUGBuqll17S4MGD5eTkJG9vb/n6+hZE3QAA\nAIUqT78TNX78eLvhu595Cg0NVWho6MOtCgAAoIjjF8sBAAAsIEQBAABYQIgCAACwgBAFAABgASEK\nAADAAkIUAACABYQoAAAACwhRAAAAFhCiAAAALCBEAQAAWECIAgAAsIAQBQAAYAEhCgAAwAJCFAAA\ngAWEKAAAAAsIUQAAABYQogAAACwgRAEAAFhAiAIAALCAEAUAAGABIQoAAMACQhQAAIAFhCgAAAAL\nCFEAAAAWEKIAAAAsIEQBAABYQIgCAACwgBAFAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCi\nAAAALCBEAQAAWECIAgAAsIAQBQAAYAEhCgAAwAJCFAAAgAWEKAAAAAsIUQAAABYQogAAACwgRAEA\nAFhAiAIAALCAEAUAAGABIQoAAMACQhQAAIAFhCgAAAALCFEAAAAWEKIAAAAsIEQBAABYQIgCAACw\ngBAFAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCiAAAALCBEAQAAWECIAgAAsIAQBQAAYAEh\nCgAAwALnvEw0Y8YMHTp0SA4ODpo8ebK8vLxs4y5evKixY8cqNTVVjRo10jvvvJNvxQIAABQVuV6J\n2rt3r86cOaOIiAi9++67mj59ut34WbNmaejQofrss8/k5OSkCxcu5FuxAAAARUWuIWrXrl0KCAiQ\nJNWtW1cJCQlKTEyUJGVkZCgqKkr+/v6SpPDwcFWrVi0fywUAACgacg1RMTExKleunG24QoUKunr1\nqiQpNjZWbm5uWrBggQYOHKjZs2fLGJN/1QKPgNq1ayskJKSwywAA5LNcn4m6NxQZY+Tg4GD7+/Ll\ny+rTp49GjRqlV155Rdu2bVPHjh1zXGZUVJT1ivFAUlJSJNHmBYk2Lzy0ecGjzQsebV505BqiqlSp\nopiYGNvwlStXVLFiRUlSuXLl9OSTT6pmzZqSpNatW+v48eO5higfH5//omQ8CFdXV6WkpNDmBYg2\nLxxRUVG0eQGjzQsebV7wcgqtud7Oa9u2rbZs2SJJOnr0qCpXriw3NzdJkrOzs2rUqKHTp09Lkn7+\n+WfVqVPnIZQMAABQtOV6JapZs2by8PBQaGioHBwcFB4ersjISJUuXVqBgYGaPHmywsPDlZycrHr1\n6tkeMgcAAHic5el3osaPH2833KBBA9vftWrV0kcfffRQiwIAACjq+MVyAAAACwhRAAAAFhCiAAAA\nLCBEAQAAWECIAgAAsIAQBQAAYAEhCgAAwAJCFAAAgAWEKAAAAAsIUQAAABYQogAAACwgRAEAAFhA\niAIAALCAEAUAAGABIQoAAMACQhQAAIAFhCgAAAALCFEAAAAWEKIAAAAsIEQBAABYQIgCAACwgBAF\nAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCiAAAALCBEAQAAWOBc2AU8btptGlvYJdi5dOu6\npKJX1/bgOYVdAgAA/xWuRAEAAFhAiAIAALCAEAUAAGABIQoAAMACQhQAAIAFhCgAAAALCFEAAAAW\nEKIAAAAsIEQBAABYQIgCAACwgBAFAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCiAAAALCBE\nAQAAWECIAgAAsIAQBQAAYAEhCgAAwAJCFAAAgAWEKAAAAAsIUQAAABYQogAAACwgRAEAAFhAiAIA\nALCAEAUAAGABIQoAAMACQhQAAIAFhCgAAAALCFEAAAAWEKIAAAAsIEQBAABYkKcQNWPGDPXr10+h\noaE6fPhwltPMnj1bgwYNeqjFAQAAFFXOuU2wd+9enTlzRhERETpx4oQmTZqktWvX2k1z4sQJ7du3\nTy4uLvlWKAAAQFGS65WoXbt2KSAgQJJUt25dJSQkKDEx0W6aWbNmacyYMflTIQAAQBGU65WomJgY\neXh42IYrVKigq1evys3NTZIUGRmpFi1aqHr16nleaVRUlIVS8Th5nI+BlJQUSY/3NhZVtHnBo80L\nHm1edOQaoowx9w07ODhIkuLi4hQZGakVK1bo8uXLeV6pj4/PA5b5CNm0qrAreCQ8zseAq6urUlJS\nHuttLIqioqJo8wJGmxc82rzg5RRac72dV6VKFcXExNiGr1y5oooVK0qSdu/erdjYWA0YMEAjRozQ\nzz//rBkzZjyEkgEAAIq2XEP6d8EtAAAWQ0lEQVRU27ZttWXLFknS0aNHVblyZdutvK5du+rLL7/U\nmjVrtGjRInl4eGjy5Mn5WzEAAEARkOvtvGbNmsnDw0OhoaFycHBQeHi4IiMjVbp0aQUGBhZEjQAA\nAEVOriFKksaPH2833KBBg/umeeqpp/Tpp58+nKoAAACKOH6xHAAAwAJCFAAAgAWEKAAAAAsIUQAA\nABYQogAAACwgRAEAAFhAiAIAALCAEAUAAGBBnn5sEwCKstq1ayslJUUXLlwo7FIA/IZwJQoAAMAC\nQhQAAIAFhCgAAAALCFEAAAAWEKIAAAAsIEQBAABYQIgCAACwgBAFAABgASEKAADAAkIUAACABYQo\nAAAACwhRAAAAFhCiAAAALCBEAQAAWECIAgAAsIAQBQAAYIFzYRcA/LfabRpb2CXYuXTruqSiV9f2\n4DmFXQIAPFa4EgUAAGABIQoAAMACQhQAAIAFhCgAAAALCFEAAAAWEKIAAAAsIEQBAABYQIgCAACw\ngBAFAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCiAAAALCBEAQAAWECIAgAAsIAQBQAAYAEh\nCgAAwAJCFAAAgAWEKAAAAAsIUQAAABYQogAAACwgRAEAAFhAiAIAALCAEAUAAGABIQoAAMACQhQA\nAIAFhCgAAAALCFEAAAAWEKIAAAAsIEQBAABYQIgCAACwgBAFAABgASEKAADAAkIUAACABc55mWjG\njBk6dOiQHBwcNHnyZHl5ednG7d69W3PmzJGjo6Pq1Kmj9957T46OZDMAAPB4yzVE7d27V2fOnFFE\nRIROnDihSZMmae3atbbxb731lj755BNVrVpVo0aN0vbt29WhQ4d8LRp513z5S4VdAgAAj6VcLxnt\n2rVLAQEBkqS6desqISFBiYmJtvGRkZGqWrWqJKl8+fK6fv16PpUKAABQdOQaomJiYlSuXDnbcIUK\nFXT16lXbsJubmyTpypUr2rlzJ1ehAADAb0Kut/OMMfcNOzg42L127do1DR8+XG+99ZZd4MpOVFTU\nA5aJxw3HQMF7nNs8JSVF0uO9jUUVbV7waPOiI9cQVaVKFcXExNiGr1y5oooVK9qGExMT9fLLL+v1\n11+Xn59fnlbq4+NjodRHxKZVhV3BI+GhHgO0eZ48zuedq6urUlJSHuttLIqioqJo8wJGmxe8nEJr\nrrfz2rZtqy1btkiSjh49qsqVK9tu4UnSrFmzNGTIEG7jAQCA35Rcr0Q1a9ZMHh4eCg0NlYODg8LD\nwxUZGanSpUvLz89P69ev15kzZ/TZZ59JkoKDg9WvX798LxwAAKAw5el3osaPH2833KBBA9vfP/30\n08OtCAAA4BHAr2ICAABYQIgCAACwgBAFAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCiAAAA\nLCBEAQAAWJCnXywHgLu12zS2sEuwc+nWdUlFr67twXMKuwQA+YgrUQAAABYQogAAACwgRAEAAFhA\niAIAALCAEAUAAGABIQoAAMACQhQAAIAFhCgAAAALCFEAAAAWEKIAAAAsIEQBAABYQIgCAACwgBAF\nAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCiAAAALCBEAQAAWECIAgAAsIAQBQAAYAEhCgAA\nwAJCFAAAgAWEKAAAAAsIUQAAABYQogAAACwgRAEAAFhAiAIAALCAEAUAAGABIQoAAMACQhQAAIAF\nhCgAAAALCFEAAAAWEKIAAAAsIEQBAABYQIgCAACwwLmwCwAAPHpq166tlJQUXbhwobBLAQoNV6IA\nAAAsIEQBAABYQIgCAACwgBAFAABgASEKAADAAkIUAACABYQoAAAACwhRAAAAFhCiAAAALCBEAQAA\nWMB/+wI8ZM2Xv1TYJQAACgAhCgAeAe02jS3sEuxcunVdUtGra3vwnMIuAb8h3M4DAACwgBAFAABg\nASEKAADAgjyFqBkzZqhfv34KDQ3V4cOH7cbt3LlTffv2Vb9+/bR48eJ8KRIAAKCoyTVE7d27V2fO\nnFFERITeffddTZ8+3W78u+++q4ULF2r16tXavn27Tpw4kW/FAgAAFBW5hqhdu3YpICBAklS3bl0l\nJCQoMTFRknT27FmVLVtWTz75pBwdHdWhQwft2rUrfysGAAAoAnL9iYOYmBh5eHjYhitUqKCrV6/K\nzc1NV69eVfny5W3jKlasqLNnz+a60qioKIvlFn3znhxQ2CU8Eh7mMUCb583j3OYhTn+XVPTqos0L\n3uP8/pLpt7CNj4pcQ5Qx5r5hBweHLMdJso3LiY+PT17rw0MQFRVFmxcw2rxgubq6KiUlhTYvQLR5\n4aBvKXg5hdZcb+dVqVJFMTExtuErV66oYsWKWY67fPmyKlWq9N/UCgAA8EjINUS1bdtWW7ZskSQd\nPXpUlStXlpubmyTpqaeeUmJios6dO6e0tDR99913atu2bf5WDAAAUATkejuvWbNm8vDwUGhoqBwc\nHBQeHq7IyEiVLl1agYGBmjZtmsaNGydJ6t69u+rUqZPvRQMAABS2PP3feePHj7cbbtCgge3v5s2b\nKyIi4uFWBQAo0k6fPs0DzvjN4xfLAQAALCBEAQAAWECIAgAAsIAQBQAAYAEhCgAAwAJCFAAAgAWE\nKAAAAAsIUQAAABbk6cc2AaAo44cfARQGrkQBAABYQIgCAACwgBAFAABgASEKAADAAkIUAACABYQo\nAAAACwhRAAAAFhCiAAAALCBEAQAAWECIAgAAsIAQBQAAYAEhCgAAwAJCFAAAgAWEKAAAAAscjDGm\nIFcYFRVVkKsDAAD4r/j4+GT5eoGHKAAAgMcBt/MAAAAsIEQBAABYQIgCAACwgBAFAABgASEKAADA\nAkJUFs6dO6dGjRoVdhkFbuLEiVqyZEm+LX/27NlavXp1vi0/O0V5f3bt2lUxMTGFXcYj68UXX1Rk\nZGSO06xcuVLz5s0roIoeHYcPH9ZLL70kSYqJidE333xjeVmFdW4/Kh5G37pnzx4FBgY+8Hz0MfnL\nubALwG/HuHHjCruEImfz5s2FXcJjb+DAgYVdQpHk5eWl5cuXS7rzBr1z50517tzZ0rI4t4su+pj8\nxZWoHHz22WcKCQlRhw4dtGnTJmVkZGju3Lnq2rWrunbtqokTJyopKUmS5O/vr/3799vmzRxOS0vT\n1KlTFRQUpMDAQI0YMUKJiYmSpG+++UYhISHq3Lmzhg4dqtjY2Fxr8vf319///nf17dtXfn5+mjVr\nlqT7P6XcPbxw4UKFh4fr1VdflZ+fn9544w19++236t27t/z8/PTdd9/Z5rt8+bIGDhyoTp06KSws\nzLZ9J06c0MCBAxUUFKSQkBAdOXLEtp7Q0FCNHj0614707k9jBw8eVO/evdW1a1d1795dO3fulHTn\nqpGfn59mzJhhe/Pbs2ePnnvuOXXt2lXPP/+8bd0PqijuT3d3d126dEmStHjxYgUFBSkgIECvvvqq\nEhISJN3Zf1OnTlXfvn310Ucf5Vh3UXbvvs1uv2ZkZOjtt99WUFCQ/P399cYbbyg1NVWSdPbsWT3/\n/PMKCAjQuHHjlJ6enut6Fy5cqClTpkiSTp06pf79+6tbt24KDAzUpk2bbNO5u7tr2bJlCgoKUnp6\nuo4dO6bQ0FB17dpVvXr10vbt2/OhVQrO559/rqCgIAUFBemNN97Q9u3bFRgYqJ9//lnvvPOOtmzZ\nojFjxqh37952b7zffvutnn322RyXXdjndkHLqh/45ptvsu2Dpez71rv7gLuHs+tb33//fQUFBalr\n1646cOBArrX+lvqYQmFwn7Nnzxp3d3ezevVqY4wxX331lencubPZtGmTefbZZ83NmzdNenq6+cMf\n/mAWL15sjDGmU6dOZt++fbZlZA5/9913ZvDgwSYjI8NkZGSYuXPnmh9++MFcuHDBNG/e3Pz666/G\nGGM++OADM3LkyFxr69Spkxk7dqxJS0szly5dMh4eHubixYtm9+7dJiAgwDbd3cMLFiww7dq1M9eu\nXTOxsbGmcePGJjw83BhjzKeffmr69+9vjDFmwoQJplOnTubatWsmLS3NDBgwwHz00UcmPT3dBAcH\nmzVr1hhjjNm/f7/x8/MzqampZvfu3cbT09Ps3Lkz19onTJhga6/g4GCzadMmY4wxn3/+ua3Ws2fP\nGg8PDxMZGWmMMebmzZumZcuWZv/+/cYYYzZv3my6dOli0tPTc11fpqK8P+vXr28uXrxojhw5Ylq3\nbm1u3Lhh0tPTzYsvvmirZcGCBcbPz89cu3bNGGNyrLsou3vf5rRfN2/ebIKDg01KSoq5ffu26dat\nm1m/fr0xxphRo0aZOXPmGGOMOXTokGnUqJFZt25djutdsGCBmTx5sjHGmFdffdUsW7bMGGPM3r17\njZeXl0lJSTHG3NkXS5cuNcYYk56ebrp162Y2btxojDHm8OHDpnnz5ubGjRsPuVUKxtmzZ02rVq3M\npUuXTEZGhgkLCzN//etf7fqIzDb629/+ZsLCwmzzTpo0ydZm2Smsc7uwZNcPZNcHZ9e3GvOfPiBT\n5vC9fevu3btNw4YNbW0bERFhevXqlWutv6U+pjBwJSobxhj16tVLktSoUSNdunRJ33//vZ599lmV\nLFlSjo6O6t27t3788cccl1O+fHmdPHlSX3/9tW7duqXRo0erXbt2+vbbb+Xp6an69etLkvr3769v\nv/02T5+sQ0JC5OTkpCpVqqhChQq6ePFirvM0a9ZM5cuXV7ly5VSpUiV16NBBklS/fn1duXLFNl37\n9u1Vvnx5OTk5KTAwUP/617906tQpRUdHq0+fPpLu/Px9+fLldfDgQUlS8eLF1bp161xruNv69evV\nrVs32/LOnj1rG5eammr7BHfo0CFVrVrV9pP7QUFBun79us6fP/9A6yvK+1OSGjdurO+//15ubm5y\ndHSUt7e3XZs0adJE5cuXlyRLdRcVmfs2p/0aFBSkdevWycXFRcWKFZOnp6etLfbv3287bry8vPT0\n008/0PqXLFliew7Ix8dHycnJunr1qm18x44dJd25ahITE6MePXpIkjw9PVWtWrVH5krJvX788Ud5\ne3urSpUqcnBw0OzZs7N9TrB79+7avn27bty4oYyMDH333Xe2Ns+Lgj63C0NW/YCrq2uO82TVt+bm\n3r61WLFitrbt1q2bfvnlFyUnJ+ep5t9KH1PQeCYqG05OTipRooQkydHRURkZGYqNjVXZsmVt05Qt\nW1bXrl3LcTleXl6aOnWqPv30U02YMEH+/v4KDw/XjRs3dOjQIXXt2tU2rZubm+Li4lShQoUcl+nm\n5mZXZ17eqEuVKmU3T8mSJe22LVPmSSRJpUuXVkJCghISEpSenq7u3bvbxiUmJiouLk5lypSxa5O8\n2rhxoz755BPdvHlTGRkZMnf970NOTk62bYyNjVWZMmXs5i1durSuXbumGjVq5Hl9RXl/StKtW7c0\nc+ZM7dmzR5IUHx9ve0PPrC2TlbqLisx9m9N+LVWqlKZPn66jR4/KwcFBMTExGjJkiKQ77XL38X/v\nMnKzfft2LV26VNevX5eDg4OMMXbH/xNPPCHpThuXLl1aDg4OduvKyy3aouj69et2bVWsWDE5OTll\nOW2VKlXk5eWlrVu3qmbNmqpevfoDnWsFfW4Xhqz6gdyCZlZ9a27u7VufeOIJOTreufaR2Y7x8fGq\nXLlyrsv6rfQxBY0Q9QAqVqyouLg423BcXJwqVqwo6f4wEh8fb/s7875yXFycJk+erOXLl6tWrVpq\n06aNFixY8FBquzdM3b3+B3H3fAkJCSpbtqwqV66sUqVKZfmAYuYJ+SAuX76sqVOnau3atWrYsKFO\nnz6toKCgLKetUKGCXZsbYxQfH5+nYJKborQ/P/74Y50+fVqRkZEqVaqU5s6dq8uXLz9w3Y+KnPbr\n3Llz5ezsrI0bN8rV1dXueZAyZcrYnkGT9EChJjU1VaNHj9a8efPUoUMHpaSkyMvLK9v64uPjZYyx\nBam8BuKiqFy5crYrx9KdD0E5vSn26NFDmzdvVq1atew+POWmqJzbBeHefuDTTz/NsQ/Oqm+V7vQ1\nmfPl1m/fuwzpP8E/N7+1PqagcDvvAXTo0EH/+Mc/dOvWLaWlpWnt2rW222KVKlXSsWPHJElffvml\n7RLrunXrtHjxYkl3DvbM2w9t27bV/v37bZdTDx8+rHfffddybZUqVdLVq1d17do1paen2z0w+yB+\n+OEHxcfHKz09XV9//bV8fHxUvXp1Va1a1RaiYmNjNXbsWMsPGsbGxqpkyZKqU6eO0tLSFBERIUl2\nb46ZvLy8dPXqVdsbwBdffKGqVavqqaeesrTuuxWl/Xnt2jXVqVNHpUqV0vnz5/X999/r5s2bD1z3\noyKn/Xrt2jXVq1dPrq6uOnbsmA4ePGhri6ZNm+rrr7+WJB04cEDR0dF5XuetW7eUlJRku4318ccf\ny8XFJct2fuqpp1S1alV9+eWXtnXFxMRkG7qKug4dOujAgQM6d+6cjDEKDw+3aztnZ2fduHHDNty1\na1dFRUVp8+bNdldXc1NUzu38llU/kFsfnFXfKtn3NevWrbNdacrK7du3bcf/5s2b5enpmettxEy/\ntT6moHAl6gF069ZN//73v9W7d28ZY9SyZUsNHjxYkvTaa68pPDxca9asUVBQkOrWrStJ6ty5syZP\nnqwuXbrIyclJtWrV0qxZs/TEE09o+vTpCgsLU2pqqkqVKqXJkydbrq1WrVrq06ePnn32WVWrVk29\nevXSL7/88sDL6dSpk0aOHKlz586pcePG6tOnjxwcHDRnzhxNmzZN8+bNk6Ojo37/+9/bbgk+qAYN\nGqh9+/by9/fXk08+qYkTJ+rAgQN64YUX7vstlZIlS2r+/PmaPn26kpKSVL58ec2ZM8fuNotVRWl/\nhoaGauTIkfL391fjxo01adIkhYWFacWKFQ9U96Mip/06dOhQvfnmm/rss8/UsmVLTZgwQRMnTlST\nJk30xhtvaNy4cdqwYYOaNGmiNm3a5HmdZcqU0bBhwxQSEqKqVavqD3/4gwICAjRs2DBt2bLFbtrM\nYz48PFyLFi1SiRIlNH/+fMvHfGGrWrWq3nnnHQ0ZMkROTk7y9PRUo0aNtG7dOkl3PgSsWLFCffr0\n0bp16/TEE0+oefPmio+PV7Vq1fK8nqJybue37PoBNze3bPvgrPpWSRozZoymTZumBQsWKDQ01O52\n9b2efvppHTx4ULNnz5ajo6Pt29l58VvrYwqKg7n7hjWQjyZOnKiaNWvqtddeK+xSigx3d3dt27ZN\nVatWLexSHlsLFy7UpUuX9N577xV2KY+UadOmqV69ehowYECu03JuF130MfmL23koMAkJCbaHu/Gf\nZxqKFy9eyJU83jjuHtzp06f1ww8/qGfPnnmanjYumuhj8h+384qYZcuW6fPPP89y3PDhw3P90bvC\ndPLkSYWFhWU5rlKlSjp37pwmTpxYwFUVruz2Z3p6uqKjo9WtW7c8PxiKrCUmJqpv375ZjitVqpQu\nXbqkRYsWFXBVj6758+drw4YN+uMf/6jSpUtL4twuyuhjChe38wAAACzgdh4AAIAFhCgAAAALCFEA\nAAAWEKIAAAAsIEQBAABYQIgCAACw4P8BbAf4jUIcCKcAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# extract feature importances from random forest classifier\n", "feature_importance_to_match = rf.feature_importances_\n", "\n", "# calculate standard deviation of feature importances across trees\n", "std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)\n", "indices = np.argsort(feature_importance_to_match)[::-1]\n", "\n", "# plot importances alongside feature labels\n", "plt.figure(figsize=(10,6))\n", "plt.title(\"Feature importances of address attributes to match\", size=15)\n", "plt.bar(range(X_train.shape[1]), feature_importance_to_match[indices],\n", " color=\"#3CB371\", yerr=std[indices], align=\"center\")\n", "feature_labs = X_train.columns[np.argsort(feature_importance_to_match)[::-1]].values\n", "plt.xticks(range(X_train.shape[1]), feature_labs, size=12)\n", "plt.xlim([-1, X_train.shape[1]])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our case, and as one might expect, the restaurant's house number, `house_number_jaro` is the most important feature used for resolving candidate pairs of addresses into a match while the suburb, `suburb_jaro`, is the least important feature and so could possibly be removed as an address field from the comparison step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Address matching is a data enrichment process that is increasingly required in wide-ranging, real-world applications. For example, matching between census, commercial or lifestyle records has the potential benefit of improving data quality, enabling spatial data visualisation and joining data that would otherwise remain isolated in data silos. In absence of unique identifiers for directly linking data, practitioners have typically relied on statistical linkage methods for matching addresses. Linking address datasets in this way has the potential to unlock attributes that one would be unable to access in circumstances where no primary keys exist to join the two datasets. Thus, in this notebook, we documented the steps required to execute the work flow for an address matching exercise that utilised new and recent innovations in machine learning. While the dataset we used was low volume, the intention of the notebook was to demonstrate an approach that is reproducible within a self-contained environment, and which might be adapted by the interested user to larger data challenges. Training a predictive model to link restaurant addresses may seem a trivial problem to solve, but these addresses could easily be replaced by more meaningful address records in areas such as public health and socio-economic mobility studies. Therefore, the core contribution of this notebook sought to equip the regional scientist with skills necessary to extend the address matching work flow to their own (and far more interesting) use cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bibliography\n", "\n", "
    \n", "
  • Baldovin, T., Zangrando, D., Casale, P., Ferrarese, F., Bertoncello, C., Buja, A., Marcolongo, A. and Baldo, V. (2015) Geocoding health data with geographic information systems: A pilot study in northeast Italy for developing a standardized data‐acquiring format. Journal of Preventive Medicine & Hygiene, 56, 88–94.
  • \n", "
  • Cayo, R. and Talbot, T. O. (2003) Positional error in automated geocoding of residential addresses. International Journal of Health Geographics, 2, 1–10.
  • \n", "
  • Christen, P. (2012) Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. New York, NY: Springer.
  • \n", "
  • Comber, S. and Arribas‐Bel, D. (2019) Machine learning innovations in address matching: A practical comparison of word2vec and CRFs. Transactions in GIS.23, 334–348.
  • \n", "
  • Damerau, F. (1964). A technique for computer detection and correction of spelling errors. Commun. ACM, 7, 171-176.
  • \n", "
  • Diesner, J., and Carley, M. (2008). Conditional random fields for entity extraction and ontological text coding.\n", "Computational and Mathematical Organization Theory, 14, 248–262
  • \n", "
  • Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting\n", "and labelling sequence data. In C. E. Brodley and A. P. Danyluk (Eds.), Proceedings of the 18th International Conference on\n", "Machine Learning (pp. 282–289). San Francisco, CA: Morgan Kaufmann
  • \n", "
  • Reynolds, P., Behren, J. V., Gunier, R., Goldberg, D., Hertz, A., and Smith, D. (2003) Childhood cancer incidence rates and hazardous air pollutants in California: An exploratory analysis. Environmental Health Perspectives, 111, 663–668.
  • \n", "
  • Ruggles, S., Fitch, C. and Roberts, E. (2018) Historical Census Record Linkage. Annual Review of Sociology, 44 (1), 19-37.
  • \n", "
  • Yancey, W. (2005) Evaluating string comparator performance for record linkage (Research Report Series, Statistics\n", "#2005‐05). Washington, DC: U.S. Bureau of the Census.
  • \n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }