Named Entity Recognition

Named Entity Recognition (NER) is the task of identifying and classifying named entities (people, organizations, locations, etc.) in unstructured text.

What are Named Entities?

Named entities are real-world objects that can be denoted with a proper name:

"Apple announced that Tim Cook will visit Paris next Monday."
  ↓                    ↓                   ↓         ↓
 ORG                 PERSON               LOC       DATE

Common Entity Types

Type	Description	Examples
PER/PERSON	People	Tim Cook, Einstein
ORG	Organizations	Apple, UN, MIT
LOC/GPE	Locations	Paris, France, Mount Everest
DATE	Dates/Times	Monday, 2024, next week
MONEY	Monetary values	$100, 50 euros
PERCENT	Percentages	20%, five percent
PRODUCT	Products	iPhone, Windows 11
EVENT	Events	World Cup, Olympics

Tagging Schemes

BIO Tagging (Most Common)

B = Beginning of entity
I = Inside entity (continuation)
O = Outside (not an entity)

"Tim Cook visited Apple headquarters"
 Tim   → B-PER
 Cook  → I-PER
 visited → O
 Apple → B-ORG
 headquarters → O

BIOES Tagging

B = Beginning
I = Inside
O = Outside
E = End of entity
S = Single-token entity

"New York" → B-LOC, E-LOC
"Paris"    → S-LOC

Traditional Approaches

Rule-Based

import re

patterns = {
    'EMAIL': r'\b[\w.-]+@[\w.-]+\.\w+\b',
    'PHONE': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
    'DATE': r'\b\d{1,2}/\d{1,2}/\d{2,4}\b'
}

def extract_entities(text):
    entities = []
    for entity_type, pattern in patterns.items():
        for match in re.finditer(pattern, text):
            entities.append((match.group(), entity_type, match.span()))
    return entities

CRF (Conditional Random Fields)

Features:
- Current word
- Previous/next word
- Is capitalized?
- Word shape (Xxxx, XXXX, xxxx)
- POS tag
- Prefix/suffix

Deep Learning Approaches

BiLSTM-CRF

Input:    Tim    Cook    visited    Apple
           ↓      ↓        ↓         ↓
Embed:   [e1]   [e2]     [e3]      [e4]
           ↓      ↓        ↓         ↓
BiLSTM:  [h1]   [h2]     [h3]      [h4]
           ↓      ↓        ↓         ↓
CRF:    B-PER  I-PER      O       B-ORG

Transformer-Based (BERT)

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

text = "Tim Cook announced new products at Apple Park"
entities = ner(text)

# Output:
# [{'entity': 'B-PER', 'word': 'Tim', 'score': 0.99},
#  {'entity': 'I-PER', 'word': 'Cook', 'score': 0.99},
#  {'entity': 'B-ORG', 'word': 'Apple', 'score': 0.98}]

SpaCy Implementation

import spacy

nlp = spacy.load("en_core_web_lg")

text = "Apple CEO Tim Cook met with President Biden in Washington."
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text:20} {ent.label_:10} {ent.start_char}-{ent.end_char}")

# Output:
# Apple                ORG        0-5
# Tim Cook             PERSON     10-18
# Biden                PERSON     43-48
# Washington           GPE        52-62

Training Custom NER

Data Format (CoNLL)

Tim    B-PER
Cook   I-PER
works  O
at     O
Apple  B-ORG
.      O

She    O
lives  O
in     O
Paris  B-LOC
.      O

Fine-tuning BERT for NER

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)

# Load pre-trained model
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label_list)  # B-PER, I-PER, B-ORG, etc.
)

# Tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True
    )
    
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)  # Ignore
            else:
                label_ids.append(label[word_id])
        labels.append(label_ids)
    
    tokenized["labels"] = labels
    return tokenized

Evaluation Metrics

Entity-Level Metrics

Predicted: [Tim Cook](PER) visited [Apple](ORG)
Gold:      [Tim Cook](PER) visited [Apple Inc.](ORG)

Precision = Correct predictions / All predictions
Recall = Correct predictions / All gold entities
F1 = 2 × (P × R) / (P + R)

Strict vs Relaxed Matching

Type	Boundary	Label
Strict	Exact match	Exact match
Relaxed	Partial overlap	Exact match
Type only	Any overlap	Exact match

Challenges

1. Ambiguity

"Apple" → Company or fruit?
"Jordan" → Person or country?
"Washington" → Person, state, or city?

2. Nested Entities

"Bank of America" → [Bank of [America](LOC)](ORG)

3. Domain Shift

General NER trained on news ≠ good for:
- Medical texts (diseases, drugs)
- Legal documents (laws, courts)
- Scientific papers (genes, proteins)

Applications

Information Extraction: Extract structured data from text
Question Answering: Identify entity mentions in queries
Search: Entity-based indexing and retrieval
Knowledge Graphs: Populate nodes from text
Content Recommendation: Tag content by entities
Anonymization: Redact PII (names, addresses)

Key Takeaways

NER identifies and classifies named entities in text
BIO tagging is the standard labeling scheme
Transformer models (BERT) achieve state-of-the-art results
Evaluate at entity level, not token level
Domain-specific NER often requires custom training
Handle ambiguity through context and fine-tuning