Thursday, December 13, 2007

JavaScript Number Format - Decimal Precision

Original Source :: http://www.mredkj.com/javascript/nfbasic2.html



Introduction

JavaScript has built-in methods to format a number to a certain precision. They are toFixed and toPrecision, and are part of the Number object. Any browser that supports ECMAScript version 3 should support toFixed and toPrecision. This roughly equates to Netscape 6.0 and above and IE 5.5 and above.

Examples

Use toFixed to set precision after the decimal point. It doesn't matter how large the number is before the decimal point. For normal decimal formatting, this is your best option.

// Example: toFixed(2) when the number has no decimal places
// It will add trailing zeros

var num = 10;
var result = num.toFixed(2); // result will equal 10.00

// Example: toFixed(3) when the number has decimal places
// It will round to the thousandths place

num = 930.9805;
result = num.toFixed(3); // result will equal 930.981


Use toPrecision when you're setting the overall precision. Here, it matters how large the number is before and after the decimal point. This is more useful for mathematical purposes than for formatting.

// Example: toPrecision(4) when the number has 7 digits (3 before, 4 after)
// It will round to the tenths place

num = 500.2349;
result = num.toPrecision(4); // result will equal 500.2

// Example: toPrecision(4) when the number has 8 digits (4 before, 4 after)
// It will round to the ones place

num = 5000.2349;
result = num.toPrecision (4); // result will equal 5000

// Example: toPrecision(2) when the number has 5 digits (3 before, 2 after)
// It will round to the tens place expressed as an exponential

num = 555.55;
result = num.toPrecision(2); // result will equal 5.6e+2

Floating-point errors

toFixed and toPrecision are subject to floating-point errors.

Here is a test where the starting number is 162.295. The following should show the JavaScript results:

162.29 // toFixed(2)
162.29 // toPrecision(5)

Do they show up correctly as 162.30 in your browser? Most JavaScript implementations will display it as 162.29

Here is basically what happens when rounding 162.295 to two decimal places

num = 162.295
num *= 100 // 16229.499999999998
num = Math.round(num) // 16229
num /= 100 // 162.29

As you can tell, it's in the second step that the number changes from its actual value.

Floating-point numbers - External references

bugnet.com - JavaScript Math Errors in Netscape & Internet Explorer
wikipedia.org - Problems with floating-point

 

Tuesday, December 11, 2007

[dos batch] Irfan view command line options

Original Source :: http://en.irfanview-forum.de/viewtopic.php?t=490

/one - force 'Only one instance'
/fs - force Full Screen display
/bf - force 'Fit images to desktop' display option
/title=text - set window title to 'text'
/pos=(x,y) - move the window to x,y
/convert=filename - convert input file to 'filename' and close IrfanView
(see Pattern page for additional options)
/slideshow=txtfile - play slideshow with the files from 'txtfile'
/slideshow=folder - play slideshow with the files from 'folder'

/thumbs - force thumbnails
/killmesoftly - close all IrfanView instances
/closeslideshow - close slideshow and close IrfanView after the last image
/page=X - open page number X from a multipage input image
/crop=(x,y,w,h) - crop input image: x-start, y-start, width, height
/print - print input image to default printer and close IrfanView
/print="Name" - print input image to specific printer and close IrfanView

/resize=(w,h) - resize input image to w (width) and h (height)
/resample=(w,h) - resample input image to w (width) and h (height)
/capture=X - capture the screen or window (see examples below)
/ini - use the Windows folder for INI/LST files (read/save)
/ini="Folder" - use the folder "Folder" for INI/LST files (read/save)
/clippaste - paste image from the clipboard
/clipcopy - copy image to the clipboard

/silent - don't show error messages for command line read/save errors
/invert - invert the input image
/dpi=(x,y) - change DPI values
/scan - acquire the image from the TWAIN device - show TWAIN dialog
/scanhidden - acquire the image from the TWAIN device - hide TWAIN dialog
/batchscan=(options) - simulate menu: File->Batch Scanning, see below for example
/bpp=BitsPerPixel - change color depth of the input image to BitsPerPixel

/swap_bw - swap black and white color
/gray - convert input image to grayscale
/rotate_r - rotate input image to right
/rotate_l - rotate input image to left
/filepattern="x" - browse only specific files
/sharpen=X - open image and apply the sharpen filter value X
/contrast=X - open image and apply the contrast value X
/hide=X - hide toolbar, status bar, menu and/or caption of the main window (see examples below)

/aspectratio - used for /resize and /resample, keep image proportions
/info=txtfile - write image infos to "txtfile"
/append=tiffile - append image as (TIF) page to "tiffile"
/multitif=(tif,files) - create multipage TIF from input files
/jpgq=X - set JPG save quality
/tifc=X - set TIF save compression
/wall=X - set image as wallpaper
/extract=(file,ext) - extract all pages from a multipage file
/import_pal=palfile - import and apply a special palette to the image (PAL format)

/monitor=X - start EXE-Slideshow on monitor X

[DOS batch]How I get filename without extension?

@echo off
setlocal
set List=C:\Temp\*.*
for /f "delims=" %%a in ('dir /b "%List%"') do echo %%~na

Monday, December 10, 2007

[Google OCR] - Tesseract training programme

Original Source : http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
TrainingTesseract  
How to use the tools provided to train Tesseract for a new language.

Introduction

Tesseract 2.0 is fully trainable. This page describes the training process, provides some guidelines on applicability to various languages, and what to expect from the results.

Background and Limitations

Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. Tesseract 2.0 can handle any Unicode characters (coded with UTF-8), but there are limits as to the range of languages that it will be successful with, so please take this section into account before building up your hopes that it will work well on your particular language!

Tesseract can only handle left-to-right languages. While you can get something out with a right-to-left language, the output file will be ordered as if the text were left-to-right. Top-to-bottom languages will currently be hopeless.

Tesseract is unlikely to be able to handle connected scripts like Arabic. It will take some specialized algorithms to handle this case, and right now it doesn't have them.

Tesseract is likely to be so slow with large character set languages (like Chinese) that it is probably not going to be useful. There also still need to be some code changes to accommodate languages with more than 256 characters.

Any language that has different punctuation and numbers is going to be disadvantaged by some of the hard-coded algorithms that assume ASCII punctuation and digits.

Data files required

To train for another language, you have to create 8 data files in the tessdata subdirectory. The naming convention is languagecode.file_name Language codes follow the ISO 639-3 standard. The 8 files used for English are:

  • tessdata/eng.freq-dawg
  • tessdata/eng.word-dawg
  • tessdata/eng.user-words
  • tessdata/eng.inttemp
  • tessdata/eng.normproto
  • tessdata/eng.pffmtable
  • tessdata/eng.unicharset
  • tessdata/eng.DangAmbigs

How little can you get away with?

You must create inttemp, normproto, pfftable and unicharset using the procedure described below. If you are only trying to recognize a limited range of fonts (like a single font for instance), then a single training page might be enough. DangAmbigs and user-words may be empty files. The dictionary files freq-dawg and word-dawg don't have to be given many words if you don't have a wordlist to hand, but accuracy will be lower than if you have a decent sized (10s of thousands for English say) dictionary.

Training Procedure

Some of the procedure is inevitably manual. As much automated help as possible is provided. More automated tools may appear in the future. The tools referenced below are all built in the training subdirectory.

Generate Training Images

The first step is to determine the full character set to be used, and prepare a text or word processor file containing a set of examples. The most important points to bear in mind when creating a training file are:

  • Make sure there are a minimum number of samples of each character. 10 is good, but 5 is OK for rare characters.
  • There should be more samples of the more frequent characters - at least 20.
  • Don't make the mistake of grouping all the non-letters together. Make the text more realistic. For example, The quick brown fox jumps over the lazy dog. 0123456789 !@#$%^&(),.[]{}<>/? is terrible. Much better is The (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog & duck/goose, as 12.5% of E-mail from aspammer@website.com is spam? This gives the textline finding code a much better chance of getting sensible baseline metrics for the special characters.
  • It is a good idea to space out the text a bit when printing, so up the inter-character and inter-line spacing in your word processor.
  • The training data currently needs to fit on a single page.
  • There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)

Next print and scan (or use some electronic rendering method) to create an image of your training page. Upto 32 training pages can be used. It is best to create pages in a mix of fonts and styles, including italic and bold.

You will also need to save your training page as a UTF-8 text file for use in the next step where you have to insert the codes into another file.

Make Box Files

For the next step below, Tesseract needs a 'box' file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image. Tesseract 2.0 has a mode in which it will output a text file of the required format, but if the character set is different to its current training, it will naturally have the text incorrect. So the key process here is to manually edit the file to put the correct characters in it.

Run Tesseract on each of your training images using this command line:

tesseract fontfile.tif fontfile batch .nochop makebox

You then have to rename fontfile.txt to fontfile.box.

Now the hard part. You have to edit the file fontfile.box and put the UTF-8 codes for each character in the file at the start of each line, in place of the incorrect character put there by Tesseract. Example: The distribution includes an image eurotext.tif. Running the above command produces a text file that includes the following lines (lines 142-155):

s 734 491 751  516
p
753 483 776 515
r
779 492 796 516
i
799 492 810 525
n
814 492 837 516
g
839 483 862 516
t
865 491 878 520
u
101 452 122 483
b
126 453 146 486
e
149 452 168 477
r
172 453 187 476
d
211 450 232 483
e
236 450 255 474
n
259 451 281 474

Since Tesseract was run in English mode, it does not correctly recognize the umlaut. This character needs to be corrected using a suitable editor. An editor that understands UTF-8 should be used for this purpose. HTML editors are usually a good choice. (Mozilla on linux allows you to edit utf8 text files directly from the browser. Firefox and IE do not let you do this. MS Word is very good at handling different text encodings, and Notepad++ is another editor that understands UTF-8.) Linux and Windows both have a character map that can be used for copying characters that cannot be typed. In this case the u needs to be changed to ΓΌ.

In theory, each line in the box file should represent one of the characters from your training file, but if you have a horizontally broken character, such as the lower double quote „ it will probably have 2 boxes that need to be merged!

Example: lines 117-130:

D 101 503  131 534
e
135 501 154 527
r
158 503 173 526
, 197 496 205 507
, 206 496 214 508
s
220 499 236 524
c
239 499 258 523
h
262 500 284 532
n
288 500 310 524
e
313 499 332 523
l
336 500 347 533
l
352 500 363 532
e
367 499 386 524
" 389 520 407 532

As you can see, the low double quote character has been expressed as two single commas. The bounding boxes must be merged as follows:

  • First number (left) take the minimum of the two lines (197)
  • Second number (bottom) take the minimum of the two lines (496)
  • Third number (right) take the maximum of the two lines (214)
  • Fourth number (top) take the maximum of the two lines (508)

This gives:

D 101  503 131 534
e
135 501 154 527
r
158 503 173 526
197 496 214 508
s
220 499 236 524
c
239 499 258 523
h
262 500 284 532
n
288 500 310 524
e
313 499 332 523
l
336 500 347 533
l
352 500 363 532
e
367 499 386 524
" 389 520 407 532

If you didn't sucessfully space out the characters on the training image, some may have been joined into a single box. In this case, you can either remake the images with better spacing and start again, or if the pair is common, put both characters at the start of the line, leaving the bounding box to represent them both. (As of 2.01, there is a limit of 8 bytes for the description of a "character". This will allow you between 2 and 8 unicodes to describe the character, depending on where your codes sit in the unicode set. If anyone hits this limit, please file an issue describing your situation.)

Note that the coordinate system used in the box file has (0,0) at the bottom-left.

If you have an editor that understands UTF-8, this process will be a lot easier than if it doesn't, as each UTF-8 character has upto 4 bytes to code it, and dumb editors will show you all the bytes separately.

There is a visual basic tool that you can use (windows only) to make box file creation much easier. See http://groups.google.com/group/tesseract-ocr/files and look for bbtesseract. You can also check out this thread: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2321deb561450e76/554c7a8cec11c073#554c7a8cec11c073 in the forum for more information. Thanks to unkowner for contributing this.

Bootstrapping a new character set

If you are trying to train a new character set, it is a good idea to put in the effort on a single font to get one good box file, run the rest of the training process, and then use Tesseract in your new language to make the rest of the box files as follows:

tesseract fontfile.tif fontfile - l yournewlanguage batch.nochop makebox

This should make the 2nd box file easier to make, as there is a good chance that Tesseract will recognize most of the text correctly. You can always iterate this sequence adding more fonts to he training set (i.e. to the command line of mfTraining and cnTraining below) as you make them, but note that there is no incremental training mode that allows you to add new training data to existing sets. This means that each time you run mfTraining and cnTraining you are making new data files from scratch from the tr files you give on the command line, and these programs cannot take an existing intproto/pffmtable/normproto and add to them directly.

New! Tif/Box pairs provided!

The Tif/Box file pairs are on the downloads page. (Note the tiff files are G4 compressed to save space, so you will have to have libtiff or uncompress them first). You could follow the following process to make better training data for your own language or subset of an existing language:

  1. Filter the box files, keeping lines for only the characters you want.
  2. Run tesseract for training (below).
  3. Cat the .tr files from multiple languages for each font to get the character set that you want.
  4. Cat the filtered box files in an identical way to the .tr files for handing off to unicharset_extractor.
  5. Run the rest of the training process.
Caution! This is not quite as simple as it sounds! cntraining and mftraining can only take upto 32 .tr files, so you must cat all the files from multiple languages for the same font together to make 32 language-combined, but font-individual files. The characters found in the tr files must match the sequence of characters found in the box files when given to unicharset_extractor, so you have to cat the box files togther in the same order as the tr files. The command lines for cn/mftraining and unicharset_extractor must be given the .tr and .box files (respectively) in the same order just in case you have different filterings for the different fonts. There may be a program available to do all this and pick out the characters in the style of character map. This might make the whole thing easier.

Run Tesseract for Training

For each of your training image, boxfile pairs, run Tesseract in training mode:

tesseract fontfile .tif junk nobatch box.train

Note that the box filename must match the tif filename, including the path, or Tesseract won't find it. The output of this step is fontfile.tr which contains the features of each character of the training page. Note also that the output name is derived from the input image name, not the normal output name, shown here as junk. junk.txt will also be written with a single newline and no text.

Important Check (linux:stderr, windows:tesseract.log) for the output from apply_box. If there are FATALITIES reported, then there is no point continuing with the training process until you fix the box file. A FATALITY usually indicates that this step failed to find any training samples of one of the characters listed in your box file. Either the coordinates are wrong, or there is something wrong with the image of the character concerned. If there is no workable sample of a character, it can't be recognized, and the generated inttemp file won't match the unicharset file later and Tesseract will abort.

Another error that can occur that is also fatal and needs attention is an error about "Box file format error on line n". If preceded by "Bad utf-8 char..." then the utf-8 codes are incorrect and need to be fixed. The error "utf-8 string too long..." indicates that you have exceeded the 8 (v2.01) byte limit on a character description. If you need a description longer than 8 bytes, please file an issue. Box file format errors without either of the above errors indicate either something wrong with the bounding box integers, or possibly a blank line in the box file. Blank lines are actually harmless, and the error can be ignored in this case. They could be ignored by the code, but it doesn't ignore them in case there is something unintentional wrong with the box file.

There is no need to edit the content of the fontfile.tr file. The font name inside it need not be set. For the curious, here is some information on the format:

Every character in the box file has a corresponding set of entries  in
the
.tr file (in order ) like this
UnknownFont <utf8 code (s)> 2
mf
<number of features >
x y length dir
0 0
... (there are a set of these determined by <number of features >
above
)
cn
1
ypos length x2ndmoment y2ndmoment

The mf features are polygon segments of the outline normalized to the
1st and 2nd moments.
x
= x position [- 0.5.0.5]
y
= y position [-0.25, 0.75]
length
is the length of the polygon segment [ 0,1.0]
dir
is the direction of the segment [0,1.0]

The cn feature is to correct for the moment normalization to
distinguish position
and size (eg c vs C and , vs ')

Clustering

When the character features of all the training pages have been extracted, we need to cluster them to create the prototypes. The character shape features can be clustered using the mftraining and cntraining programs:

mftraining fontfile_1.tr fontfile_2 .tr ...

This will output two data files: inttemp (the shape prototypes) and pffmtable (the number of expected features for each character). (A third file called Microfeat is also written by this program, but it is not used.)

cntraining fontfile_1.tr fontfile_2 .tr ...

This will output the normproto data file (the character normalization sensitivity prototypes).

Compute the Character Set

Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the same training pages bounding box files as used for clustering:

unicharset_extractor fontfile_1 .box fontfile_2.box ...

Tesseract needs to have access to character properties isalpha, isdigit, isupper, islower. This data must be encoded in the unicharset data file. Each line of this file corresponds to one character. The character in UTF-8 is followed by a hexadecimal number representing a binary mask that encodes the properties. Each bit corresponds to a property. If the bit is set to 1, it means that the property is true. The bit ordering is (from least significant bit to most significant bit): isalpha, islower, isupper, isdigit.

Example:

  • ';' is not an alphabetic character, a lower case character, an upper case character nor a digit. Its properties are thus represented by the binary number 0000 (0 in hexadecimal).
  • 'b' is an alphabetic character and a lower case character. Its properties are thus represented by the binary number 0011 (3 in hexadecimal).
  • 'W' is an alphabetic character and an upper case character. Its properties are thus represented by the binary number 0101 (5 in hexadecimal).
  • '7' is just a digit. Its properties are thus represented by the binary number 1000 (8 in hexadecimal).
 ; 0
b
3
W
5
7 8

If your system supports the wctype functions, these values will be set automatically by unicharset_extractor and there is no need to edit the unicharset file. On some older systems (eg Windows 95), the unicharset file must be edited by hand to add these property description codes.

NOTE The unicharset file must be regenerated whenever inttemp, normproto and pffmtable are generated ( i.e. they must all be recreated when the box file is changed) as they have to be in sync. The lines in unicharset must be in the correct order, as inttemp stores an index into unicharset and the actual characters returned by the classifier come from unicharset at the given index.

Dictionary Data

Tesseract uses 3 dictionary files for each language. Two of the files are coded as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8 text file. To make the DAWG dictionary files, you first need a wordlist for your language. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into two sets: the frequent words, and the rest of the words, and then use wordlist2dawg to make the DAWG files:

wordlist2dawg frequent_words_list freq- dawg
wordlist2dawg words_list word
-dawg

The third dictionary file is called user-words and is usually empty.

The last file

The final data file that Tesseract uses is called DangAmbigs. It represents the intrinsic ambiguity between characters or sets of characters, and is currently entirely manually generated. To understand the file format, look at the following example:

1       m       2       r n
3       i i i   1       m

The first field is the number of characters in the second field. The 3rd field is the number of characters in the 4th field. As with the other files, this is a UTF-8 format file, and therefore each character may be represented by multiple bytes. The first line shows that the pair 'rn' may sometimes be recognized incorrectly as 'm'. The second line shows that the character 'm' may sometimes be recognized incorrectly as the sequence 'iii' Note that the characters on both sides should occur in unicharset. This file cannot be used to translate characters from one set to another.

The DangAmbigs file may also be empty.

Putting it all together

That is all there is to it! All you need to do now is collect together all 8 files and rename them with a lang. prefix, where lang is the 3-letter code for your language taken from http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes and put them in your tessdata directory. Tesseract can then recognize text in your language (in theory) with the following:

tesseract image.tif output -l lang

(Actually, you can use any string you like for the language code, but if you want anybody else to be able to use it easily, ISO 639 is the way to go.)


Wednesday, December 5, 2007

Extracting text from pdf file

Using ghostscript...
#> ps2ascii filename.pdf convertedText.txt

Merging PDF file with gs

Merging PDF files together can be very useful.
I use it to collect all the tutorials for a course into
one file and all the solutions into another. 
This makes it easier for students to print all the
tutorials for a few weeks in one go and has other practical benefits.

There are lots of programs available on the web which merge PDFs. 
One which has been specifically brought to my attention is pdftk
the pdf tool kit, which is free software made available under the GPL. 
It is also possible to merge PDFs on the command-line 
if you have GhostScript installed.

To merge three PDF files entitled '1.pdf','2.pdf' and '3.pdf' into one file 
called 'all.pdf', the command is:

For Windows PCs:
gswin32 -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=Merged.pdf -dBATCH 1.pdf 2.pdf 3.pdf

For Unix PCs:
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=Merged.pdf -dBATCH 1.pdf 2.pdf 3.pdf

I use 2 Windows batch files to work through a directory of PDFs. 
It assumes the existence of a first file 1.pdf and joins all other 
PDFs alphabetically, outputting the file merged.pdf.

The batch files can be found here and the source is below. 
Execute Merge1 in the directory where the files to be joined 
are located. These programs have not been thoroughly 
tested and are used at your own risk. 
They were thrown together but they seem to work!

Merge1:
@echo off
gswin32 -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=merged.pdf -dBATCH 1.pdf
FOR %%Z IN (*.pdf) DO IF NOT %%Z==1.pdf IF NOT %%Z==merged.pdf IF NOT %%Z==merged2.pdf call merge2.bat %%Z

Merge2:
@echo off
gswin32 -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=merged2.pdf -dBATCH merged.pdf %1
del merged.pdf
ren merged2.pdf merged.pdf


For Windows users:
If you receive the error

"'gswin32' is not recognized as an internal or external command, 
operable program or batch file",
then you need to add GhostScript to your path.
To do this, go to My Computer->Properties->Advanced tab.
Click on Environment Variables and edit the System Variable called 'Path'.
Add "c:\Program Files\gs\gs8.00\bin;" (or wherever your installation of GhostScript resides)
to the end of the path string.

Centos 7 reset root/ any user lost password / lockout due to cant remember password

1. Need to be in front of the terminal. (Physically if not vm). 2. Reboot the server 3. Press 'e' in the GRUB2 boot screen. 3. bunch...