• Nie Znaleziono Wyników

Mass Digitisation and OCR

N/A
N/A
Protected

Academic year: 2021

Share "Mass Digitisation and OCR"

Copied!
44
0
0

Pełen tekst

(1)

Mass digitization and OCR

Aly Conteh

Collections Digitisation & Strategy Programme Manager

British Library

(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)

Beowulf

1994

(10)
(11)
(12)
(13)

What

challenges

do we face with

our mass digitisation programme

(14)

…the amount of

data

we

generate

(15)

…25

million

pages of 19

th

century books

(16)

…old standard of uncompressed

(17)
(18)

…the solution for us was to use

(19)

…using

lossy!

compression,

under 40TB

(20)

…is

visually

lossless

(21)
(22)

50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 100.0 AN JO BN ER BN WL BD PO BR PT BLM Y CN MR CH PN CH TR CH TT CTC R CW PR DN LN DYM R EX LN FR JO GW HD GLA D GC LN HPT E HLP A IPN W IPJOJO JL LEM R LVM R LIN P LN DH MR TM MC LN NEC T NR WC NR EC NR SR OD FW OPT E PM GZ PM GU PN CH RDNPSN SR ER LN TEFPWMC F Newspaper Code

characters words significant words words with capital letter start

(23)

…four issues that affect

(24)

PHYSICAL CHARACTERISTICS OF SOURCE MATERIAL Bleed through Stains Tight binding Holes/tears Creases Paper quality Inconsistent inking Dirt Stamps Printer errors Animals Repairs Lamination

(25)
(26)
(27)

The numbers

There are

152

words

61

words are incorrectly identified

Giving us 60%

word

accuracy

But all

words

are not equal

(28)
(29)

They had the internet in 1816 !

(30)

and DVD in 1803!

(31)

…Summary

of issues

Geometric distortions lead to text being missed or incorrectly identified

Quality of source material has a notable impact on accuracy levels

The need to focus on significant words

False positives can be introduced by using modern lexicons

(32)

How are we

addressing

the OCR

issues?

(33)

Innovating OCR software and language technology

Sharing expertise and building capacity across Europe

Ensuring that tools and services will be sustained after the end of the project

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

(34)
(35)

Facts and figures

• Project supported by the European Community under the FP7 ICT Work Programme.

• coordinated by the National Library of the Netherlands (KB) • EU funding: € 11 500 000

• Start date: 1 January 2008 • Duration: 48 months

• From 2012: sustainable Centre of Competence

(36)
(37)

…advancement in the

state of

the art

in image enhancement,

(38)

38

Original image Dewarping v.1.1

(39)

…improvements with

OCR

(40)
(41)

Building Named Entities resources (Source Material 34,500 pages)

Dictionary of National Biography

Alumni Oxonienses: the members of the University of Oxford 1500-1714

Alumni Oxonienses: the members of the University of Oxford, 1715-1886

Wilson's Mercantile Directory of Great Britain and Ireland Cassell’s Gazetteer of Great Britain and Ireland

(42)
(43)
(44)

Thank You

http://www.impact–project.eu/ Twitter: @impactocr http://impactocr.wordpress.com/ www.bl.uk aly.conteh@bl.uk twitter: @aconteh

Cytaty

Powiązane dokumenty

In the case of used locally renewable energy sources is very important, because it is necessary to appropriate energy management and usually that is not

Another step of our research was to compare the current total consumption of potable water in PK6 building with volume of rainwater runoff from the roof which represents a

W świetle przepisów art. o postępowaniu w sprawach nieletnich można orzec umieszczenie nieletniego, który popełnił przestępstwo, w zakładzie poprawczym, jeżeli: a) u

gramamr information, foreign languages trans- lations) and abbreviations. • semicolon (;) — ends smaller parts of the

Przy każdym imieniu na początku podaje się jego znane bądź przybliżone znaczenie, oraz to, z jakiego języka się wywodzi, a także możliwie wszystkie miejsca biblijne,

In Slovak Republic, education in the field of environment protection, agrobiodiversity and plant genetic resources is provided by secondary schools and universities under

that it have sufficient carbon content. Low carbon steels can be hardened, at least on the surface, by heat treating at an elevated temperature in an atmosphere containing

The crystallization process occurs in two stages: 1. As the energy in the liquid system decreases, the movement of the atoms decreases and the probability increases