intermediateNatural Language Processing

Learn about Named Entity Recognition (NER), the NLP task of identifying and classifying named entities in text.

nernlpinformation-extractionsequence-labeling

Named Entity Recognition

Named Entity Recognition (NER) is the task of identifying and classifying named entities (people, organizations, locations, etc.) in unstructured text.

What are Named Entities?

Named entities are real-world objects that can be denoted with a proper name:

"Apple announced that Tim Cook will visit Paris next Monday."
  ↓                    ↓                   ↓         ↓
 ORG                 PERSON               LOC       DATE

Common Entity Types

TypeDescriptionExamples
PER/PERSONPeopleTim Cook, Einstein
ORGOrganizationsApple, UN, MIT
LOC/GPELocationsParis, France, Mount Everest
DATEDates/TimesMonday, 2024, next week
MONEYMonetary values$100, 50 euros
PERCENTPercentages20%, five percent
PRODUCTProductsiPhone, Windows 11
EVENTEventsWorld Cup, Olympics

Tagging Schemes

BIO Tagging (Most Common)

B = Beginning of entity
I = Inside entity (continuation)
O = Outside (not an entity)

"Tim Cook visited Apple headquarters"
 Tim   → B-PER
 Cook  → I-PER
 visited → O
 Apple → B-ORG
 headquarters → O

BIOES Tagging

B = Beginning
I = Inside
O = Outside
E = End of entity
S = Single-token entity

"New York" → B-LOC, E-LOC
"Paris"    → S-LOC

Traditional Approaches

Rule-Based

import re

patterns = {
    'EMAIL': r'\b[\w.-]+@[\w.-]+\.\w+\b',
    'PHONE': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
    'DATE': r'\b\d{1,2}/\d{1,2}/\d{2,4}\b'
}

def extract_entities(text):
    entities = []
    for entity_type, pattern in patterns.items():
        for match in re.finditer(pattern, text):
            entities.append((match.group(), entity_type, match.span()))
    return entities

CRF (Conditional Random Fields)

Features:
- Current word
- Previous/next word
- Is capitalized?
- Word shape (Xxxx, XXXX, xxxx)
- POS tag
- Prefix/suffix

Deep Learning Approaches

BiLSTM-CRF

Input:    Tim    Cook    visited    Apple
           ↓      ↓        ↓         ↓
Embed:   [e1]   [e2]     [e3]      [e4]
           ↓      ↓        ↓         ↓
BiLSTM:  [h1]   [h2]     [h3]      [h4]
           ↓      ↓        ↓         ↓
CRF:    B-PER  I-PER      O       B-ORG

Transformer-Based (BERT)

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

text = "Tim Cook announced new products at Apple Park"
entities = ner(text)

# Output:
# [{'entity': 'B-PER', 'word': 'Tim', 'score': 0.99},
#  {'entity': 'I-PER', 'word': 'Cook', 'score': 0.99},
#  {'entity': 'B-ORG', 'word': 'Apple', 'score': 0.98}]

SpaCy Implementation

import spacy

nlp = spacy.load("en_core_web_lg")

text = "Apple CEO Tim Cook met with President Biden in Washington."
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text:20} {ent.label_:10} {ent.start_char}-{ent.end_char}")

# Output:
# Apple                ORG        0-5
# Tim Cook             PERSON     10-18
# Biden                PERSON     43-48
# Washington           GPE        52-62

Training Custom NER

Data Format (CoNLL)

Tim    B-PER
Cook   I-PER
works  O
at     O
Apple  B-ORG
.      O

She    O
lives  O
in     O
Paris  B-LOC
.      O

Fine-tuning BERT for NER

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)

# Load pre-trained model
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label_list)  # B-PER, I-PER, B-ORG, etc.
)

# Tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True
    )
    
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)  # Ignore
            else:
                label_ids.append(label[word_id])
        labels.append(label_ids)
    
    tokenized["labels"] = labels
    return tokenized

Evaluation Metrics

Entity-Level Metrics

Predicted: [Tim Cook](PER) visited [Apple](ORG)
Gold:      [Tim Cook](PER) visited [Apple Inc.](ORG)

Precision = Correct predictions / All predictions
Recall = Correct predictions / All gold entities
F1 = 2 × (P × R) / (P + R)

Strict vs Relaxed Matching

TypeBoundaryLabel
StrictExact matchExact match
RelaxedPartial overlapExact match
Type onlyAny overlapExact match

Challenges

1. Ambiguity

"Apple" → Company or fruit?
"Jordan" → Person or country?
"Washington" → Person, state, or city?

2. Nested Entities

"Bank of America" → [Bank of [America](LOC)](ORG)

3. Domain Shift

General NER trained on news ≠ good for:
- Medical texts (diseases, drugs)
- Legal documents (laws, courts)
- Scientific papers (genes, proteins)

Applications

  1. Information Extraction: Extract structured data from text
  2. Question Answering: Identify entity mentions in queries
  3. Search: Entity-based indexing and retrieval
  4. Knowledge Graphs: Populate nodes from text
  5. Content Recommendation: Tag content by entities
  6. Anonymization: Redact PII (names, addresses)

Key Takeaways

  1. NER identifies and classifies named entities in text
  2. BIO tagging is the standard labeling scheme
  3. Transformer models (BERT) achieve state-of-the-art results
  4. Evaluate at entity level, not token level
  5. Domain-specific NER often requires custom training
  6. Handle ambiguity through context and fine-tuning