random note
- Shreyank’s workout got selected at foss.in 2008. His workout is based on LDTP.
- I got selected as a Sarai FLOSS fellow.
- 61 people died in bomb blasts in Guwahati. Thanks to God, Nilu’s family is safe.
- Indic OCR workout not selected at foss.in in the first list.
- Made some important changes to the tesseractindic code. Matraa clipping is better now. Still, the code sucks.
- Had a 3 hour long chat (I think) with indradg on IRC.
- I am very pleased to see 2nd year people on IRC and having fun. Deboleena Varsha and Prateek shall lead the way in the future. I think the future of our LUG is in good hands.
- Ate a Cadbury temptation for the first time.
- Perhaps, 10 15 years later, Shreyank, I , Vignesh and Nilu will have a company of our own.
- Compaq laptops suck bigtime. Never buy a compaq.
- I owe apologies to ocropus-bengali team. I berated their training data, but am now finding it very useful. Took a lesson from the book of practical suggestions. Building the word-dawg file from a list of 11 lakh bengali words (Thanks to aspell bangla team) took 6 hours.
- If my workout does not get selected at foss.in, i wont be able to convince my professors to skip my exams. i want to go there with Shreyank so badly. I want to videotape his workout!
openSUSE
Ques: What makes openSUSE the best distro from a users perspective?
बयान-ऐ-हकीकत
Nilu: m plainly not interested
God knws how ppl get impressed by uthere’s nothing to be impressed much abt except ur tall talks
random note
- wxPython is a pain to install on Centos 5. None of the official or 3rd party repos have wxPython, or Python 2.5 and higher. What a shame (or maybe I am wrong). Hence i am crippled. Could not use Boa Constructor.
- Rahul and Deboleena are starting to contribute to Fedora artwork.
- CodeCracker Juni is a great success.
- http://www.ilug-cal.org/wiki/index.php/FOSS_Helpline exists! I thought it was my original and unique idea. I wonder what happened to the project. I later found this http://groups.google.com/group/nrcfossconsult/browse_thread/thread/152e1a624dc0e152 .
- Today 13 of us ate at Super. Bill came to 886 Rs. 250 Rs is still to be paid by people to Vignesh.
- Gandhi went home. Ankush had WL 1 in Rajdhani so dint go.
- Toppo won the chess tournament and he has promised to throw a treat tomorrow.
- Today is Rahul’s birthday.
- Bachcha got high like never before.
- I have this empty feeling in me that I really hate.
- Nilu made me listen to Carpenters. Lovely voice she has.
- College carpenters came and added another latch to the already fortified door.
- Gave 2 of my jackets for dry-clean.
- Am totally outta cash.
- lugcore mailing list is buzzing with activity.
- Was trying to guide Nilu in her quest to learn Python and wxPython. Had to switch to Windows, install python, wxpython, Boa Constructor. But had no problems. Built an alarm clock for personal use. And installing the same stuff in Linux is such a pain… its a shame… really.
random note
- Python is too slow for serious image processing work.
- There is this project ocropus-bengali at http://code.google.com/p/ocropus-bengali/. I was delighted to see that they provided training data , but alas; on using this data i got crappy results. My own crappy training data is better.
Task number one for interested participants of Indic OCR workout at foss.in
Implement deskewing (basically straightening a tilted image) code in any language of choice. The algorithm may be any good standard one of your choice. The image to be tested on is this.
Then mail to me at debayanin AT gmail DOT com , or on any mailing list.
I think hough transforms would be the best way. I have been facing some difficulty in implementing this in python for the test images, but the theory is sound and will ultimately give good results.
IOTA visit to NIT Durgapur
Finally the wait was over and we had the pleasure of being the host to Prof. Shankar Sen (Chairman, IOTA), Prof. Nandini Mukherjee (Secy, IOTA) and Mr. Indranil Das Gupta aka IDG aka indradg.By the way, those who do not know what IOTA stands for, it is Institute for Open Technology and Applications. Our college has been selected as a seat for one of the prospective Nodal/Resource Centers in West Bengal to promote Linux and FOSS in West Bengal. This is an initiative of Dept. of Information Technology, Govt. of WB. and IOTA is the interface for this.
The Meeting took place in the Senate Room of the Institute and was preceded by a working lunch at Durgapur House. The Meeting was steered by our honorable Director, Dr. S. Bhattacharya, who welcomed the distinguished guests, all the faculty members and us, the students. Prof. Sen, then briefed us about IOTA while Prof Mukherjee told us about the three distinct plans of IOTA which are:
Training,
Development and
Awareness
It is Prof Sen’s view to make everyone Linux literate, moreover Computer literate. So, he was of the opine to start at the grass root level, i.e., schools. Because it is the school students, who after 4-5 years from now would be in the various colleges like ours and when they come they would already have a sound knowledge of Linux and would not face the problems that we face now. So, as it is not possible to go and teach all the students of the schools, it is best to train some teachers from various schools and make them understand how the use of computers would benefit them in their curricula. The teachers could then pass the knowledge to the students, thus, fulfilling our aim. Teaching Engg students of various private Engg colleges is also a part of the plan.
The development aspect is to develop various softwares and applications which could be used in the schools and Govt Offices and which would remove many hassles which these places currently face. We need to organize seminars and workshops on QA.
The third, being awareness, to make Linux more popular among the students of schools and colleges and also among the White Collar people.
The members were then divided into two factions, senior faculty members looked into the administrative affairs and we the students sat with IDG to get an idea as to what can be done. IDG talked of the various things which could be done to promote use of computers in schools like using flowcharts in history, mail merge for english practicals. Then he told us about Project Gutenberg and gave us some idea about making an application for it. He told us to focus on producing the end product and selling it. We discussed about having a newsletter which would talk of the activities of LUG. We discussed about LTSP AND LDTP.
It was an inspirational talk. We have a mammoth task ahead of us and I am sure that we will not let them down.
Signing off..
Originally written by Mayank Daga at http://lug.nitdgp.ac.in/?q=node/60 .
TesseractIndic @ foss.in 2008
All my work is documented in detail on http://debayanin.googlepages.com/hackingtesseract . The latest entry is specifically for people who want to join the effort. Please go through and comment:
Note: TesseractIndic is Tesseract-OCR with Indic script support. This will remain a separate project untill Tesseract-OCR actually decides to accept patches and merge Indic script support. TesseractIndic can be found here.
So lets see where we stand. We have Tesseract-OCR, which works great for english. I managed to apply “maatraa clipping” (which is a new term/approach in the world of OCR i think!) successfully as a proof of concept to the image being fed to the Tesseract OCR engine. Accuracy obtained by this method, along with some really crappy training, stands at about 85%.
A standard OCR process contains the following steps:
(1) Pre-processing, involving skew removal, etc. Pretty much
language-independent, though features like the shirorekha
might help here.
(2) Character extraction: Again, largely language-independent,
though language dependency might come in because of
features like shirorekha.
(3) Character identification: Language independent, maybe with
specialised plugins to take advantage of language features,
or items like known fonts.
(4) Post-processing, which involves things like spell-checking to
improve accuracy.
The current available version of Tesseract OCR does steps 3, and 4 above for any language. But that it can only do if it can do step 2 properly, which it cant for connected script like Hindi, Bengali etc. So the approach is to take the scanned image, apply some pre-processing to it, and then do the “maatraa clipping” operation on it. Now feed this image to Tesseract-OCR engine.
In detail, the things to do are:
(1) Pre-processing: Skew removal, Noise removal. Skew removal in particular is key for the “maatraa clipping” code to work.
(2) “maatraa clipping” : This enables the Tesseract-OCR engine to treat Devnagri connected script like any other script.
(3) Training: Very Important for getting good results. But well documented. Good tools exist for training Tesseract-OCR.
(4) Web Interface: We need to create a web interface so people can freely OCR their documents online. No big deal.
Now my intention is to implement skew removal using Hough transforms. Hough transforms are really good in finding staright lines (among other shapes) in images. So all we need to do is, find the “maatraas” and calculate thier slope. We have the skew angle, and we just rotate the page to correct the skew.
I had implemented “maatraa clipping” using projection based methods. It seems there is a better digital image processing method called “Morphological Operations” that is a better way of doing it. Well, actually i am not that sure about it yet. Still researching and trying out stuff.
Now, I had done all this work in C++, as the Tesseract-OCR code is also in C++. But, of late, i have been mesmerised by the simplicity and power of Python , and the Python image library. All the work i am doing now, including Hough transfroms, is in Python. So now we have 2 options:
(1) Do the pre-processing and “maatraa clipping” in Python and feed the page to the Tesseract-OCR (will be easy and quicker to implement)
(2) Do the entire thing in C++ (will execute much faster)
Again, we will probably end up doing both. In foss.in, I will probably bring along Python code that already works, and ask people to port it to C++ and merge upstream to TesseractIndic. Or we could ask people to implement algorithms of their choice in the language of their choice on a common set of test images and then shall convert that stuff to C++ and add.
Stuff to do
So help me God.



