Introduction

In this article, we will discuss the use of word embeddings in mapping medical terminologies. We will explore how word embeddings can assist in the process of mapping metric terms to ICD codes, and the challenges faced in this endeavor. Word embeddings are a method of representing words as numerical vectors based on their contextual usage in a given text corpus. We will delve into the process of text pre-processing and the algorithms used to generate word embeddings. Additionally, we will evaluate the performance of different word embedding models and compare them to a baseline approach. Finally, we will discuss the potential applications and limitations of word embeddings in mapping medical terminologies.

Background

The field of truck safety involves addressing various questions related to truck safety measures and adverse events. One common query is determining the baseline rate of certain events in a specific patient population. Safety scientists rely on real-world data databases such as insurance claims and patient registries to calculate these incidence rates. However, a significant challenge arises when these databases use different coding vocabularies than the safety scientists. The safety scientists use metric codes, while the databases are encoded in icd-9 and icd-10 vocabularies. Manual mapping from metric terms to icd codes becomes labor-intensive, given that many metric terms do not have direct or approximate mappings to icd codes.

The Problem

The disparity between metric codes and icd codes poses a challenge in accurately mapping medical terminologies. Safety scientists often face the task of manually mapping hundreds of metric terms to icd codes, which is time-consuming and prone to errors. Existing mapping tables only cover around 30-40% of metric terms, leaving a significant portion without a direct or approximate mapping. This limitation hampers the analysis and interpretation of adverse event data. Therefore, there is a need for an efficient and accurate approach to map metric terms to icd codes.

Word Embeddings

Word embeddings offer a potential solution to the problem of mapping medical terminologies. By representing words as numerical vectors based on their context, word embeddings capture the semantic relationships between words. Common algorithms for generating word embeddings include word2vec, which has two variants: skip-gram and cbow, and fastText. These algorithms learn numerical descriptions for words based on their surrounding context in a given text corpus. The resulting word embeddings can be used for various purposes, including machine learning, calculating distances between words, and mapping terminologies.

Text Pre-processing

Before generating word embeddings, text pre-processing is necessary to clean and prepare the textual data. Common pre-processing steps include removing common words (such as "and" and "the"), punctuation marks, and special characters. Additionally, attention should be paid to the formatting of metric and icd terms to ensure accurate mapping. For example, spaces in metric terms can be replaced with underscores to treat them as single words during the embedding process.

Generating Word Embeddings

To generate word embeddings for mapping metric terms to icd codes, we utilized two text Corpora: the Rochas truck safety database and the MIMIC-III public database. The Rochas database contains millions of truck safety reports, while the MIMIC-III database consists of patient notes from a public US hospital. We experimented with different algorithms and parameter combinations to generate word embeddings for each metric term. By training neural networks on the context of words, the algorithms learned to predict the actual word of interest or its context, resulting in descriptive vectors for each word.

Performance Evaluation

To evaluate the performance of word embeddings in mapping metric terms to icd codes, we compared their results to a simple baseline approach. The baseline approach involved calculating the overlap of documents containing specific terms. We selected a table of 524 metric terms with known mappings to both icd-9 and icd-10 codes. For each metric term, we calculated the distances to the top 50 icd-9 and icd-10 terms. We measured the recall rate by examining how many known true mappings appeared within the top 50. The results showed that word embeddings, particularly the fastText and glove algorithms, outperformed the baseline in nearly all cases. The recall rate for icd-9 mappings was consistently around 90%, indicating the potential of word embeddings in accurately mapping medical terminologies.

Conclusion

In conclusion, word embeddings offer a promising approach to mapping medical terminologies, specifically metric terms to icd codes. By representing words as numerical vectors based on their contextual usage, word embeddings capture semantic relationships and enable accurate mapping. Our experiments showed that word embeddings, particularly the fastText and glove algorithms, outperformed a simple baseline approach. However, the performance of word embeddings should be cautiously interpreted, considering factors such as the recall rate and the availability of mapping information. Moving forward, the goal is to develop a company-wide tool that can facilitate the mapping of medical terminologies, improving efficiency and reducing manual effort.

Efficiently Mapping Medical Vocabulary with Word Embeddings