Methodology
Data & Sources — how Politick works
This page is generated from the pipeline's own metadata — coverage, models and freshness — so it cannot drift from reality. Politick is built from machine OCR, translation and AI tagging of public documents; it will have errors, and they are shown here, not hidden.
The pipeline
- 1Collectparliament.lk Hansard PDFs, polled every 6h
- 2OCRTesseract sin + tam + eng, 300 DPI
- 3TranslateGPT-5 → unified English Markdown
- 4Identifyspeaker → MP roster matching
- 5EnrichAI summaries + topic tags (labelled)
- 6Publishdatabase · search · this site
Coverage
10th Parliament · 21 November 2024 to 10 June 2026. 2 sitting-date(s) published upstream are not yet transcribed. Dead-letter queue: 0 open of 0 ever recorded. Pipeline runs: 2 succeeded, last 22 June 2026.
| Stage | Done | Model |
|---|
Field coverage
What share of speeches carry each field — honest, generated. A 0% means the field isn't captured yet.
- Speeches matched to an MP— the rest are role-only / procedural attributions 64.3%
- Speeches with an AI summary 71.5%
- Speeches with a topic tag 71.5%
- Speeches with page/column anchors— not captured yet — the source PDF is the citable location (known gap) 0%
AI use
AI does disambiguation, segmentation, summarisation and topic tagging — and nothing else: no scoring, no prediction, no interpretation. Every AI output is labelled in the UI. The models behind each stage:
| Stage | Model |
|---|
Accuracy & verification
| Stage | Benchmark | Result |
|---|---|---|
| OCR | vs professional ground truth (validation) | ~1.8% char error |
| Translation | vs professional translator | benchmark pending |
| Speaker matching | human-reviewed sample | benchmark pending |
The OCR figure is from the feasibility validation; the translation and speaker-matching benchmarks are not yet published — shown honestly as pending rather than estimated.
Dataset register
| Dataset | Source | Coverage | Cadence | Last updated | Status |
|---|---|---|---|---|---|
| Hansard (speeches) | parliament.lk | 154 sittings · 2024-11-21 to 2026-06-10 | 6h | 20 June 2026 | live |
| MP roster + profiles | parliament.lk | 225 members | daily | 22 June 2026 | live |
| Attendance | parliament.lk house-attendance | 8 sitting-days recorded | per sitting | 17 June 2026 | live |
| Questions · Bills · Votes · Gazettes · Cabinet · Budget | — | — | — | — | planned |
Known gaps
- OCR and translation are machine processes with stated error rates.
- Page/column anchors are not yet captured — the source PDF is the citable location.
- 10,872 speeches are role-only / procedural and not matched to an MP.
- Original Sinhala/Tamil text is not yet displayed alongside the English.
- Votes/divisions, questions and committee reports are not yet extracted.
Corrections
Anyone can suggest a correction or comment on any record. Submissions are public, reviewed against the source, and logged.