LINGUIST List 17.1298
Thu Apr 27 2006
Qs: Anyone to Trade Multilingual Dictionary Databases?
Editor for this issue: James Rider
<riderlinguistlist.org>
Directory
1. Joel
Shapiro,
Anyone to Trade Multilingual Dictionary Databases?
Message 1: Anyone to Trade Multilingual Dictionary Databases?
Date: 23-Apr-2006
From: Joel Shapiro <jrs_14618yahoo.com>
Subject: Anyone to Trade Multilingual Dictionary Databases?
Hello All,
I am a Windows Automated Robot Script Programmerwith an interest in multi-lingual applications.I program my robot scripts using a powerfulautomated robot scripting tool named MacroScheduler by Mjtnet (www.mjtnet.com)
My current project has the objective of enablingthe user to perform very effective web searchengine queries in languages the user has even totalunfamiliarity.
In a nutshell users E-mail my computer (server) asearch engine request list of search terms in theirnative language font or characters. My automatedrobot script does a dictionary word-for-word orword-for-short term translation of the words in theE-mail request to the user's designated or 'target'language for a search engine search in the nativefont or characters of the target language.
Currently the database for my dictionaries inseveral languages are in (English) Excelspreadsheets because without getting too technicalMacro Scheduler has specialized commands that makesinteracting with Excel a trivial proposition versusone that would otherwise be complicated.Fortunately Non-English unicode text characters keeptheir attributes just fine in English Excel. Thus,using Excel as a text parsing and calculationintermediary is recommended by other Macro Schedulerprogrammers.
The user can also designate the target languageto be the same as that of the E-mail request. Inthis case words/terms from the request are directlyimplemented in the search URL (which I will furtherdescribe in more detail shortly) and no dictionarytranslation is required.
In the current application my automated robotE-Mails the results back to the user in an attachedExcel spreadsheet.
The key to the effectiveness of my multi-lingualsearch engine interface is the establishment ofdictionaries in all languages in their respectivenative Unicode font or character sets not justwith respect to ''regular'' dictionary words butgeographic location and proper (i.e. people's)names as well.
The crux of my post is to inquire if anyone hasdeveloped an application or for that matter justextensively uses an application in 'native' Unicodefonts or characters if you would be receptive tothe idea of trading your word database with mine.
This could make say your present Russian program(Cyrillic text characters) or application trulymulti-lingual/multi-national ... perhaps with alittle help from an automated robot script withrespect to either gleaning the words/terms ormaking others language characters applicable inyour program.
I will address these topics in further detailshortly but first ...
Because Google is the current world searchengine leader it was/is my first choice forimplementing my automated robot scripts on it.
Google provides and advertises an API or''Application Programming Interface'' which providesthe user essentially some robot capability forautomated searches of their famed search engine. Inaively figured Google had no qualms or oppositionto automated scripts interfacing with their searchengine provided the number of accesses do not exceedthe limit Google sets for their API.
In other words I figured if the user is not directlyinteracting with the Google's main page via theirAPI or my robot script which has much more in theway of custom specialized functionality andcapability; it would be a ''wash''.
Wrong!
Google in its Terms of Service verbiagespecifically prohibits automated robot activityor interaction to its services from its usersunless authorized by them.
For a few moments after I read the Google'sexplicit prohibition it didn't make sense. Butthen it occurred to me Google's main order ofbusiness is their search engine and theircarnivorous assimilation of data from its users.In such ''third party'' automated script robotssuch as mine the explicit association betweenthe user's search request and the user's IPaddress is lost ... as well as one of the crownjewels of Google's company interests thatseparate it from other search engine providers.
Interestingly, other search engines I'veinvestigated appear not to have such explicitTerms of Service prohibitions as Google againstautomated scripts accessing them. Perhaps theothers have other primary business interests anddirections where the association of user and IPaddress is not so paramount.
Also, I found that the same concept of my ''packing''the ''search URL'' even easier with other searchengines! Where Google requires a different searchURL ''string'' for each language as will shortly bedescribed, other search engines have one search URL''template'' or cookie cutter format where all therobot has to do is plug in the Unicode charactersfor any language in a standard search URL and ...Viola! It works!
So, where the following examples are all withrespect to Google, the actual robot searches willnot be using Google but other search engines.However, importantly the underlying concept andmechanics are all the same.
As I mentioned earlier my Multi-Lingual MacroScheduler automated robot search engineinterface has the following format:
The user sends my computer (server) an E-mailof a list of words for a Google search in his/herpreferred or native language in the native Unicodefont or character set and designates the languagethe for which the search engine (Google) search isto be performed.
The robot automatically scans for new E-mailsand upon recognizing a valid request: valid userlogin and password, a language that is operationaland the request is valid format so the robot canact on it etc., the first thing the robot does ismake a word-for-word or word-for-short phasedictionary translation of the word list.
These will be the search engine (Google) searchterms ... again, in the target language's nativefont or character set and in the order the userlists them in the request.
The request and the target language can be thesame. In this case no dictionary translation needtake place and the words from the request aredirectly transferred ''as is'' to the searchprocessing portion of the application.
Probably most of you reading this post are awareGoogle has a ''main page'' for various languagesin a continuing worldwide collaborative effort.The portal to this capability is selecting the''Language Tools'' link on the ''regular'' EnglishGoogle web page: www.google.com
Interestingly, after implementing a Google searchany one of its various foreign language main pagesthe result URLs contain not only search words/termsin the native font, but the URLs respectively foreach language consistently maintain their format.
With Google every language has its own search URLwhich can be replaced by English.
For instance the Urdu search string for famousworld traveler and explorer Marco Polo doing ausing English characters is:
http://www.google.com/search?hl=ur&q=Marco+Polo&btnG=%D8%AA%D9%84%D8%...
The Greek search string for Marco Polo usingEnglish characters is:
http://www.google.com/search?hl=el&q=Marco+Polo&btnG=%CE%91%CE%BD%CE%...
Google provides by default the first 10 results onand the first result page. The ''next 10'' GoogleURLs for Urdu and Greek respectively are:
http://www.google.com/search?q=Marco+Polo&hl=ur&lr=lang_en&start=10&sa=Nhttp://www.google.com/search?q=Marco+Polo&hl=el&lr=lang_id&start=10&sa=N
Likewise I've found there is an equivalent of thesestandard ''next 10'' URLs in other search engines aswell.
Once my robot has parsed the search words or termsfrom the E-mail request and performed a dictionarytranslation if required, it ''plugs in'' the terms inthe search URL and deploys it bypassing the need tointeract with Google's main page for the givenlanguage or, for that matter, the main page of anysearch engine.
The Marco and Polo delineated by a plus '+' signare replaced respectively with the native Unicoderenditions of Macro and Polo in Urdu and Greek.Deploying the search URL in the respective nativefont/characters renditions of Marco and Polo willyield different, often more effective resultsdepending on the context.
More importantly where just text parsing andprocessing is the objective not only don't I needto interface with a search engine's main page ...I don't need to use a graphic browser such asMicrosoft Internet Explorer (IE), Firefox, Netscapeetc. to deploy the search URL's.
Macro Scheduler has an HTTPRequest command whichgleans the text whether it it be standard ASCIIEnglish text or the Unicode text for variousforeign languages in a fraction of a second versuswaiting for graphics of web page to stabilize instandard browsers.
For applications where pure text and no graphical(i.e. picture) aspects are involved, a MacroScheduler solution is an order of magnitude moreefficient and robust than an automated robotsolution that interacts with a browser.
The results of the search URL are URLs of web pagesthat contain and/or pertain to the search criteria.My robot recognizes these URLs and in a mostexpedited and efficient manner; again using MacroScheduler's HTTPRequest command does an HTTPRequestof the result URLs and finds instances of the wordsand terms of the search request their frequency inthe result URLs.
The result URLs and presence/frequency data of thesearch terms are ported into an Excel spreadsheetand E-mailed back to the user as an attachment.
Macro Scheduler also has specialized commands formaking the aspect of scanning, receiving andsending E-mail posts trivial as well.
I hope in this post I have adequately conveyed thegist of my Multi-Lingual Automated Robot SearchEngine Interface (MLARSEI). However, feature richI can make it, it is inherently limited by theextent of the dictionaries.
Thank you for your interest and consideration.
Regards,
Joel S.Rochester, New Yorkjrs_14618yahoo.com
Linguistic Field(s):
Translation
|