Named Entity Recognition
Named Entity Recognition (NER) is the task of identifying and classifying named entities (people, organizations, locations, etc.) in unstructured text.
What are Named Entities?
Named entities are real-world objects that can be denoted with a proper name:
"Apple announced that Tim Cook will visit Paris next Monday."
↓ ↓ ↓ ↓
ORG PERSON LOC DATE
Common Entity Types
| Type | Description | Examples |
|---|---|---|
| PER/PERSON | People | Tim Cook, Einstein |
| ORG | Organizations | Apple, UN, MIT |
| LOC/GPE | Locations | Paris, France, Mount Everest |
| DATE | Dates/Times | Monday, 2024, next week |
| MONEY | Monetary values | $100, 50 euros |
| PERCENT | Percentages | 20%, five percent |
| PRODUCT | Products | iPhone, Windows 11 |
| EVENT | Events | World Cup, Olympics |
Tagging Schemes
BIO Tagging (Most Common)
B = Beginning of entity
I = Inside entity (continuation)
O = Outside (not an entity)
"Tim Cook visited Apple headquarters"
Tim → B-PER
Cook → I-PER
visited → O
Apple → B-ORG
headquarters → O
BIOES Tagging
B = Beginning
I = Inside
O = Outside
E = End of entity
S = Single-token entity
"New York" → B-LOC, E-LOC
"Paris" → S-LOC
Traditional Approaches
Rule-Based
import re
patterns = {
'EMAIL': r'\b[\w.-]+@[\w.-]+\.\w+\b',
'PHONE': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'DATE': r'\b\d{1,2}/\d{1,2}/\d{2,4}\b'
}
def extract_entities(text):
entities = []
for entity_type, pattern in patterns.items():
for match in re.finditer(pattern, text):
entities.append((match.group(), entity_type, match.span()))
return entities
CRF (Conditional Random Fields)
Features:
- Current word
- Previous/next word
- Is capitalized?
- Word shape (Xxxx, XXXX, xxxx)
- POS tag
- Prefix/suffix
Deep Learning Approaches
BiLSTM-CRF
Input: Tim Cook visited Apple
↓ ↓ ↓ ↓
Embed: [e1] [e2] [e3] [e4]
↓ ↓ ↓ ↓
BiLSTM: [h1] [h2] [h3] [h4]
↓ ↓ ↓ ↓
CRF: B-PER I-PER O B-ORG
Transformer-Based (BERT)
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
text = "Tim Cook announced new products at Apple Park"
entities = ner(text)
# Output:
# [{'entity': 'B-PER', 'word': 'Tim', 'score': 0.99},
# {'entity': 'I-PER', 'word': 'Cook', 'score': 0.99},
# {'entity': 'B-ORG', 'word': 'Apple', 'score': 0.98}]
SpaCy Implementation
import spacy
nlp = spacy.load("en_core_web_lg")
text = "Apple CEO Tim Cook met with President Biden in Washington."
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:20} {ent.label_:10} {ent.start_char}-{ent.end_char}")
# Output:
# Apple ORG 0-5
# Tim Cook PERSON 10-18
# Biden PERSON 43-48
# Washington GPE 52-62
Training Custom NER
Data Format (CoNLL)
Tim B-PER
Cook I-PER
works O
at O
Apple B-ORG
. O
She O
lives O
in O
Paris B-LOC
. O
Fine-tuning BERT for NER
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification,
TrainingArguments,
Trainer,
DataCollatorForTokenClassification
)
# Load pre-trained model
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=len(label_list) # B-PER, I-PER, B-ORG, etc.
)
# Tokenize and align labels
def tokenize_and_align_labels(examples):
tokenized = tokenizer(
examples["tokens"],
truncation=True,
is_split_into_words=True
)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized.word_ids(batch_index=i)
label_ids = []
for word_id in word_ids:
if word_id is None:
label_ids.append(-100) # Ignore
else:
label_ids.append(label[word_id])
labels.append(label_ids)
tokenized["labels"] = labels
return tokenized
Evaluation Metrics
Entity-Level Metrics
Predicted: [Tim Cook](PER) visited [Apple](ORG)
Gold: [Tim Cook](PER) visited [Apple Inc.](ORG)
Precision = Correct predictions / All predictions
Recall = Correct predictions / All gold entities
F1 = 2 × (P × R) / (P + R)
Strict vs Relaxed Matching
| Type | Boundary | Label |
|---|---|---|
| Strict | Exact match | Exact match |
| Relaxed | Partial overlap | Exact match |
| Type only | Any overlap | Exact match |
Challenges
1. Ambiguity
"Apple" → Company or fruit?
"Jordan" → Person or country?
"Washington" → Person, state, or city?
2. Nested Entities
"Bank of America" → [Bank of [America](LOC)](ORG)
3. Domain Shift
General NER trained on news ≠ good for:
- Medical texts (diseases, drugs)
- Legal documents (laws, courts)
- Scientific papers (genes, proteins)
Applications
- Information Extraction: Extract structured data from text
- Question Answering: Identify entity mentions in queries
- Search: Entity-based indexing and retrieval
- Knowledge Graphs: Populate nodes from text
- Content Recommendation: Tag content by entities
- Anonymization: Redact PII (names, addresses)
Key Takeaways
- NER identifies and classifies named entities in text
- BIO tagging is the standard labeling scheme
- Transformer models (BERT) achieve state-of-the-art results
- Evaluate at entity level, not token level
- Domain-specific NER often requires custom training
- Handle ambiguity through context and fine-tuning