Training Tesseract

June 23, 2008 at 7:46 am (Uncategorized)

After many days, i uploaded some new stuff at http://code.google.com/p/tesseractindic/. The Bengali OCR is finally working! Results are not that good yet… but it IS working. Have to train more and imporve the maatraa clipping code.

Permalink Leave a Comment

Russia rocks man!!

June 21, 2008 at 9:42 pm (Uncategorized)

Russia defeated Holland by 3 goals to 1 in the quarterfinals of the Euro Cup. Russia thoroughly outplayed Holland right from the first minute. I supported Holland from day 0, and it did feel bad to see them lose so badly. But hail Russia!

Gus Hiddink is a master tactician. The raw burst of energy in Russian frontline was too much for the ageing Holland defense. After having seen Russia’s demolition of Sweden in the last game, i was uncomfortable with the notion of Russia facing Holland, as i told Anish in the afternoon of the day of the match.

Holland peaked too soon, and they dint deserve to win. I support the nest best team now… RUSSIA!!

Permalink Leave a Comment

It makes me mad….

June 20, 2008 at 5:13 pm (Uncategorized)

I spent three years in NIT Durgapur. People do what not to ensure a good career. Its great if they work hard and achieve glory… or simply achieve (as in poetry). Thanks to our new director, Dr. Swapan Bhattacharya, higher education/GRE has become very popular now. Its for the creme-de-la-creme (or so it was supposed to be).

It has surprised me how easy it is to plagiarise/cheat your way into institutions/scholarships/fellowships. And what more irritating, is that these people go ahead and boast about it. They write bullshit project ideas… which get accepted, and then they end up not being able to do it, but write in their CVs, and they even get stipends. i know people who have 10 “published” papers, and i know which papers they have been copy pasted from. When the time comes to select people to represent the institution, they will be selected, coz they did “research”, and ya its hard work, but no more than a thief works hard to find a house to steal from.

And in this mad stealing frenzy, where plagiarism obfuscation seems to be the skill GRE is looking for, its painful to see genuine people languishing, coding, doing actual original stuff.

So if you are a new college undergrad student in India, thinking of GRE, forget intelligence… start foot-licking your professors, and learn how to plagiarise properly.

But if you are a genuine person, and plagiarism is not your style, there is another way, the open source way. Join an open source project that belongs to your area of interest. Garner knowledge+popularity in a worldwide community, and prove permanently that your word holds good. As I see it, good people have only this choice left to counter these shameless plagiarists. Hell the day i get mad enuff, i will start exposing people.

I am cool now. Blogging is great!!

Permalink 5 Comments

gdb :)

June 19, 2008 at 6:59 am (Uncategorized)

Thanx to GNU debugger, the seg faults have been solved, and it finally worked. I finally saw Bengali text being recognised from an image. Yesssss!!

Permalink 3 Comments

Going nuts

June 19, 2008 at 6:01 am (Uncategorized)

I have been trying to train Tesseract for the last 3 days. I thought this would be the simplest step, coz it is very well documented. To my misery, i have been getting seg faults left right and center, and it is driving me nuts. I hope i can get it dine today, so that it is finally usable.

Permalink Leave a Comment

Holland 4-1 France

June 14, 2008 at 8:39 am (Uncategorized)

My only fear is that Holland may become complacent. Otherwise, they are going to lift the Euro Cup. Man of the Tournament= Arjen Robben.

Permalink Leave a Comment

skew~de-skew… ufffff

June 13, 2008 at 4:19 pm (Uncategorized)

I have been working on adding Indic script support to the Tesseract OCR engine since the last 2 weeks or so. I have maintained a more or less detailed log of progress made so far at http://debayanin.googlepages.com/hackingtesseract .

On Mr. Sankarshan’s advice, (who btw is mentoring me), i did start a developers account at code.google.com. Hence i will soon start using the hosting space @ http://code.google.com/p/tesseractindic/ properly, till my patches get accepted by Tesseract maintainers.

The key to the project was the maatraa clipping code. But for maatraa clipping to work, the page image must be absolutely straight, ie, there must be no skew. Hence i set out writing the code for finding the skew angle of a page, and then de-skewing it. I ultimately did manage something on my own, but the results are less than satisfying.

On Googling, i found that there many research papers on how to find skew angles of scanned images, but no code. Almost all use Hough transforms. I did not have the time or the patience to understand it in detail. Will do it later though. Hence i thought up an algorithm on my own, which is not that roust, but works for most images.

The de-skewing part was relatively easier. I had originally imagined just the opposite though.

Well, it was tough, but had fun. Will have to study/understand/implement Hough transforms for this as well as de-italicising the image later. For now, will live with this and will assume that the images are not skewed and will proceed to train the engine with Bengali fonts. This step will finally make it usable.

Will now go ahead and update the details.

Permalink 2 Comments

Holland beat Italy 3-0

June 10, 2008 at 5:59 am (Uncategorized)

See i told you :)

Sorry for the terrible goof-up earlier :P . I wonder what the scoreline would be if Holland does play the Indian soccer team.

Permalink 2 Comments

Tesseract_indic_alpha

June 8, 2008 at 6:56 pm (Uncategorized)

Released an alpha version at http://debayanin.googlepages.com/hackingtesseract. It works, but wait a little longer for a far more ready-to-use copy.

Permalink 1 Comment

Microsoft India sinking!!

June 7, 2008 at 3:09 pm (Uncategorized)

This blog has been taken from http://www.svaksha.com/?p=153 . Planet FLOSS India people, ignore this.

Looks like Microsoft India has finally started paying for ignoring the sentiments of the developer community in India. Internal politics, bad recruits and poor product quality are killing it. It now has an unmanageably huge code base which its developers can not handle, and also can not be open sourced because of its stupid arrogant stance on open source.

Check this out.

Permalink Leave a Comment

Next page »