Theta Lake has been granted United States patent: US 12,045,561, System and Method for Disambiguating Data to Improve Analysis of Electronic Content, covering TranscriptionRN®, a novel framework for automatically generating and ranking sound-alike and look-alike terms to improve the analysis of electronic communications.
Connecting Innovation Across the Patent Portfolio
The patent intersects with and supports a broader portfolio of patents that together define our approach to oversight of digital communications: From our foundational patent covering context-based policy detection across what is spoken, shown, and shared in video communications, to patents covering participant disambiguation and AI-assisted review workflows, this patent sharpens the conversation by understanding what was really said.
TranscriptionRN® technology applies deep linguistic and phonetic intelligence – including morphological analysis, soundalike recognition, and semantic mapping – to generate more accurate, context-rich transcripts. This enables compliance teams to find, understand, and act on risks faster and with greater precision.
When “Interest Rate” Becomes “Pinterest Rite”
Electronic communications data is messy. Transcripts produced by automated speech recognition (ASR) systems routinely confuse similar-sounding terms, misplace word boundaries, and introduce phonetic and spelling errors. Chat messages typed quickly on platforms like Slack or Microsoft Teams contain typos, abbreviations, and shorthand. Text extracted from shared screens and documents by optical character recognition (OCR) systems can produce character-level distortions. A system might render “litecoin” as “light coin,” “late fees” as “ladies,” or “interest rate” as “pinterest rite.” These aren’t edge cases — they are the baseline reality of working with communications data at scale.
For any system that relies on identifying specific terms in this data, whether for regulatory compliance, privacy, or cybersecurity, these errors represent a fundamental challenge. The terms are present in the original conversation, but they may not be present in the data as transcribed, typed, or extracted.
What the Patent Covers
Patent 12,045,561 describes TranscriptionRN®, a framework that takes a set of domain-relevant keywords and key phrases and automatically generates a comprehensive, ranked list of sound-alike and look-alike variants: a range of plausible ways those terms might appear in imperfect data. The output is a structured resource: a ranked inventory of candidate terms that downstream systems can use to improve how they analyze communications content.
Inside TranscriptionRN®: Two Stages of Intelligence
TranscriptionRN® operates in two stages. In the first, compound words from the input keywords are split into their constituent parts to create a set of seed words. This splitting is achieved using both a morphological analyzer–a process that breaks words down based on linguistic structure–and a phonetic encoding algorithm, which identifies how words might be decomposed based on how they sound. For example, “litecoin” might be broken into “lite” and “coin,” while “payable” might be decomposed into “pay” and “able”. Additionally, non-compound words from the input keywords are retained and added to the seed word collection as-is, ensuring that both decomposed components and intact words contribute to the final seed word set.
In the second stage, the system generates new sound-alike and look-alike candidates for each seed word by combining the output of three approaches: a spelling correction algorithm that identifies words within a defined edit distance, a word formation module that generates grammatical inflections and derivations, and a novel Look-Alike Sound-Alike (“LASA”) generator. The LASA generator is a new algorithm invented by Theta Lake that blends consecutive words together using word formation grammar rules to produce candidates that could plausibly be confused with the original phrase. Starting from “late fees,” for example, the system generates candidates like “ladies” and “layers” — terms that an ASR system might realistically produce by combining phonetic confusion with word boundary shifts.
The LASA Difference: Learning from How We Speak
All generated candidates are then scored and ranked using a formula that accounts for phonetic similarity to the original term, frequency in real-world spoken language, and grammatical plausibility. This ranking is critical: it enables downstream systems to apply these lists flexibly, using high-confidence candidates differently from lower-confidence ones depending on the task.
TranscriptionRN® can also be tailored to any domain by providing a word frequency list computed from conversations in that industry. A frequency list built from financial services conversations will yield sound-alike and look-alike candidates tuned to the vocabulary of finance; one built from healthcare or technology conversations will reflect the terminology of those fields. This domain adaptability means the system can serve different industries and use cases without retraining an underlying ASR model.
Why Starting with the Right Terms Matters
The quality of any sound-alike and look-alike generation system depends on the quality of the terms it starts with. A comprehensive list of domain-relevant keywords, reflecting actual regulatory requirements, industry-specific terminology, and the language patterns associated with real compliance risks, is essential for generating candidates that matter.
This is where Theta Lake’s deep regulatory knowledge and linguistic expertise come into play. The keyword sets used by Theta Lake’s classifiers are informed by both industry guidance, and the practical realities of how compliance risks manifest in everyday conversations. The patented framework described here takes those carefully constructed inputs and expands them into a far broader net of variants, one that would be nearly impossible to build by hand, given the sheer number of ways that any term can be distorted across speech, text, and OCR.
Powering Smarter Risk Detection Downstream
The sound-alikes and look-alikes generated by TranscriptionRN® can be used to fine-tune automated speech recognition error correction models and to facilitate downstream natural language processing tasks. At Theta Lake, those downstream tasks are our AI-driven risk classifiers, that analyze communications to detect regulatory, compliance, privacy, cybersecurity, and HR risks.
Theta Lake’s classifiers are purpose-built to leverage ranked sound-alike and look-alike lists as part of their detection logic. The classifiers incorporate the ranked candidates generated by this framework to identify risk-relevant language even when the underlying data is imperfect. The result is a broader, more precise risk detection. The ranking reduces noise and focuses attention where it matters most.
Hearing What Was Really Said—at Scale
This integration is important because risk detection in modern communications is not a keyword-matching exercise. It requires understanding context, intent, and the many forms that a relevant term might take when spoken by different people, in different accents, in noisy environments, or typed hastily into a chat window. The technology described in this patent gives Theta Lake’s classifiers a structured, domain-tuned vocabulary of variants that makes that understanding possible.









