Merriam-Webster and Unstructured Data Processing

2025/11/14 09:30 Merriam-Webster and Unstructured Data Processing

出典:

Merriam-Webster and Unstructured Data Processing

I recently finished reading Word by Word: The Secret Life of Dictionaries by Kory Stamper, which was an unexpected page-turner. What intrigued me most was (perhaps unsurprisingly) Stamper’s description of how Merriam-Webster gets written, and what a striking resemblance that process has to many successful unstructured data projects in the wild. I want to use this blog post to ruminate on this. First it begins with collection and curation of raw, unstructured data. Stamper describes a fascinating process called “reading and marking”, whereby editors are assigned reading of current magazines, periodicals, blogs — almost anything written in English, it seems — and read and underline any words that catch their eye: new words, or words that get used in new ways. (This is, contrary to first impressions, a non-trivial task for which requires training: good readers-and-markers will pick up on the recent trend of “bored of”, instead of the more historically common “bored with” — this doesn’t imply that bored is picking up a new meaning, but rather that of is… which as you can imagine, can get lexicographers very excited.)

George Ho

出典: https://www.georgeho.org/webster-unstructured-data/

博士

ロボ子、今日のITニュースはMerriam-Websterの辞書作成プロセスについてじゃ。

ロボ子

辞書作成ですか、興味深いですね。どのようにITと関連するのでしょう？

博士

ふむ、辞書作りもデータプロジェクトとして見れるのじゃ。記事によると、まず編集者が雑誌やブログを読んで新しい単語や使い方を「reading and marking」するらしいぞ。

ロボ子

なるほど、それが生の非構造化データを収集・キュレーションする段階ですね。

博士

そうじゃ！そして、コーパス（ツイートとかテレビ番組のトランスクリプト）も使うらしい。大量のデータセットじゃな。

ロボ子

次に、編集者が辞書全体を分担して、各単語を手作業で定義するのですね。データベースを開いて、既存の定義を修正するか、新しい単語の定義を作成または書き換える、と。

博士

その通り！平均して1単語あたり約15分かけるらしいぞ。気が遠くなる作業じゃ。

ロボ子

構造化されたデータに付加価値を与える段階ですね。そして最後に、語源、発音、日付など、既存のデータに加えて提供する機能やデータセットを提供する、と。

博士

そうじゃ！記事では、成功するデータプロジェクトのレシピとして、生の非構造化データを集めて構造化し、補助的なデータセットを提供することが挙げられているぞ。

ロボ子

Google検索も例として挙げられていますね。インターネットをクロールしてPageRankを発明し、検索を可能にした、と。

博士

PageRankはすごい発明じゃった。質問応答やカルーセルは、コアとなる提供物に追加された付随的な機能の良い例じゃな。

ロボ子

cryptics.georgeho.orgというサイトも紹介されていますね。暗号クロスワードのブログをインデックス化して、手がかり情報を解析するコードを作成した、と。

博士

ふむ、BeautifulSoupを使ったHTML解析か。地道な作業じゃな。でも、それによって暗号クロスワードの作成者にとって価値のあるリソースになったわけじゃ。

ロボ子

複数の定義を持つ単語は、定義の「重要性」ではなく、最初の使用の時系列順に定義されている、というのも面白いですね。

博士

そうじゃな。言葉は生き物、時代とともに意味が変わっていくのじゃ。辞書はそれを記録するタイムカプセルみたいなものじゃな。

ロボ子

今回のニュースから、データプロジェクトの基本的な流れを学ぶことができました。生のデータを収集し、構造化して価値を付加する、と。

博士

その通り！ロボ子もいつか、世界を変えるようなデータプロジェクトを立ち上げるのじゃぞ！

ロボ子

はい、頑張ります！ところで博士、辞書に載っている言葉で一番好きな言葉は何ですか？

博士

そうじゃな…。「昼寝」かの。…今からするかの…。

⚠️この記事は生成AIによるコンテンツを含み、ハルシネーションの可能性があります。

Other Data Science UI/UX

2025/11/14 09:30 Merriam-Webster and Unstructured Data Processing

Merriam-Webster and Unstructured Data Processing

Tags

Search

By month

Merriam-Webster and Unstructured Data Processing