LINGUIST List 17.1298
|
Thu Apr 27 2006
Qs: Anyone to Trade Multilingual Dictionary Databases?
Editor for this issue: James Rider
<rider linguistlist.org>
|
We'd like to remind readers that the responses to queries are usually best posted to the individual asking the question. That individual is then strongly encouraged to post a summary to the list. This policy was instituted to help control the huge volume of mail on LINGUIST; so we would appreciate your cooperating with it whenever it seems appropriate. In addition to posting a summary, we'd like to remind people that it is usually a good idea to personally thank those individuals who have taken the trouble to respond to the query. To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
|
Directory
1. Joel
Shapiro,
Anyone to Trade Multilingual Dictionary Databases?
Message 1: Anyone to Trade Multilingual Dictionary Databases?
|
Date: 23-Apr-2006
From: Joel Shapiro <jrs_14618 yahoo.com>
Subject: Anyone to Trade Multilingual Dictionary Databases?
Hello All, I am a Windows Automated Robot Script Programmer with an interest in multi-lingual applications. I program my robot scripts using a powerful automated robot scripting tool named Macro Scheduler by Mjtnet (www.mjtnet.com) My current project has the objective of enabling the user to perform very effective web search engine queries in languages the user has even total unfamiliarity. In a nutshell users E-mail my computer (server) a search engine request list of search terms in their native language font or characters. My automated robot script does a dictionary word-for-word or word-for-short term translation of the words in the E-mail request to the user's designated or 'target' language for a search engine search in the native font or characters of the target language. Currently the database for my dictionaries in several languages are in (English) Excel spreadsheets because without getting too technical Macro Scheduler has specialized commands that makes interacting with Excel a trivial proposition versus one that would otherwise be complicated. Fortunately Non-English unicode text characters keep their attributes just fine in English Excel. Thus, using Excel as a text parsing and calculation intermediary is recommended by other Macro Scheduler programmers. The user can also designate the target language to be the same as that of the E-mail request. In this case words/terms from the request are directly implemented in the search URL (which I will further describe in more detail shortly) and no dictionary translation is required. In the current application my automated robot E-Mails the results back to the user in an attached Excel spreadsheet. The key to the effectiveness of my multi-lingual search engine interface is the establishment of dictionaries in all languages in their respective native Unicode font or character sets not just with respect to ''regular'' dictionary words but geographic location and proper (i.e. people's) names as well. The crux of my post is to inquire if anyone has developed an application or for that matter just extensively uses an application in 'native' Unicode fonts or characters if you would be receptive to the idea of trading your word database with mine. This could make say your present Russian program (Cyrillic text characters) or application truly multi-lingual/multi-national ... perhaps with a little help from an automated robot script with respect to either gleaning the words/terms or making others language characters applicable in your program. I will address these topics in further detail shortly but first ... Because Google is the current world search engine leader it was/is my first choice for implementing my automated robot scripts on it. Google provides and advertises an API or ''Application Programming Interface'' which provides the user essentially some robot capability for automated searches of their famed search engine. I naively figured Google had no qualms or opposition to automated scripts interfacing with their search engine provided the number of accesses do not exceed the limit Google sets for their API. In other words I figured if the user is not directly interacting with the Google's main page via their API or my robot script which has much more in the way of custom specialized functionality and capability; it would be a ''wash''. Wrong! Google in its Terms of Service verbiage specifically prohibits automated robot activity or interaction to its services from its users unless authorized by them. For a few moments after I read the Google's explicit prohibition it didn't make sense. But then it occurred to me Google's main order of business is their search engine and their carnivorous assimilation of data from its users. In such ''third party'' automated script robots such as mine the explicit association between the user's search request and the user's IP address is lost ... as well as one of the crown jewels of Google's company interests that separate it from other search engine providers. Interestingly, other search engines I've investigated appear not to have such explicit Terms of Service prohibitions as Google against automated scripts accessing them. Perhaps the others have other primary business interests and directions where the association of user and IP address is not so paramount. Also, I found that the same concept of my ''packing'' the ''search URL'' even easier with other search engines! Where Google requires a different search URL ''string'' for each language as will shortly be described, other search engines have one search URL ''template'' or cookie cutter format where all the robot has to do is plug in the Unicode characters for any language in a standard search URL and ... Viola! It works! So, where the following examples are all with respect to Google, the actual robot searches will not be using Google but other search engines. However, importantly the underlying concept and mechanics are all the same. As I mentioned earlier my Multi-Lingual Macro Scheduler automated robot search engine interface has the following format: The user sends my computer (server) an E-mail of a list of words for a Google search in his/her preferred or native language in the native Unicode font or character set and designates the language the for which the search engine (Google) search is to be performed. The robot automatically scans for new E-mails and upon recognizing a valid request: valid user login and password, a language that is operational and the request is valid format so the robot can act on it etc., the first thing the robot does is make a word-for-word or word-for-short phase dictionary translation of the word list. These will be the search engine (Google) search terms ... again, in the target language's native font or character set and in the order the user lists them in the request. The request and the target language can be the same. In this case no dictionary translation need take place and the words from the request are directly transferred ''as is'' to the search processing portion of the application. Probably most of you reading this post are aware Google has a ''main page'' for various languages in a continuing worldwide collaborative effort. The portal to this capability is selecting the ''Language Tools'' link on the ''regular'' English Google web page: www.google.com Interestingly, after implementing a Google search any one of its various foreign language main pages the result URLs contain not only search words/terms in the native font, but the URLs respectively for each language consistently maintain their format. With Google every language has its own search URL which can be replaced by English. For instance the Urdu search string for famous world traveler and explorer Marco Polo doing a using English characters is: http://www.google.com/search?hl=ur&q=Marco+Polo&btnG=%D8%AA%D9%84%D8%... The Greek search string for Marco Polo using English characters is: http://www.google.com/search?hl=el&q=Marco+Polo&btnG=%CE%91%CE%BD%CE%... Google provides by default the first 10 results on and the first result page. The ''next 10'' Google URLs for Urdu and Greek respectively are: http://www.google.com/search?q=Marco+Polo&hl=ur&lr=lang_en&start=10&sa=N http://www.google.com/search?q=Marco+Polo&hl=el&lr=lang_id&start=10&sa=N Likewise I've found there is an equivalent of these standard ''next 10'' URLs in other search engines as well. Once my robot has parsed the search words or terms from the E-mail request and performed a dictionary translation if required, it ''plugs in'' the terms in the search URL and deploys it bypassing the need to interact with Google's main page for the given language or, for that matter, the main page of any search engine. The Marco and Polo delineated by a plus '+' sign are replaced respectively with the native Unicode renditions of Macro and Polo in Urdu and Greek. Deploying the search URL in the respective native font/characters renditions of Marco and Polo will yield different, often more effective results depending on the context. More importantly where just text parsing and processing is the objective not only don't I need to interface with a search engine's main page ... I don't need to use a graphic browser such as Microsoft Internet Explorer (IE), Firefox, Netscape etc. to deploy the search URL's. Macro Scheduler has an HTTPRequest command which gleans the text whether it it be standard ASCII English text or the Unicode text for various foreign languages in a fraction of a second versus waiting for graphics of web page to stabilize in standard browsers. For applications where pure text and no graphical (i.e. picture) aspects are involved, a Macro Scheduler solution is an order of magnitude more efficient and robust than an automated robot solution that interacts with a browser. The results of the search URL are URLs of web pages that contain and/or pertain to the search criteria. My robot recognizes these URLs and in a most expedited and efficient manner; again using Macro Scheduler's HTTPRequest command does an HTTPRequest of the result URLs and finds instances of the words and terms of the search request their frequency in the result URLs. The result URLs and presence/frequency data of the search terms are ported into an Excel spreadsheet and E-mailed back to the user as an attachment. Macro Scheduler also has specialized commands for making the aspect of scanning, receiving and sending E-mail posts trivial as well. I hope in this post I have adequately conveyed the gist of my Multi-Lingual Automated Robot Search Engine Interface (MLARSEI). However, feature rich I can make it, it is inherently limited by the extent of the dictionaries. Thank you for your interest and consideration. Regards, Joel S. Rochester, New York jrs_14618 yahoo.com
Linguistic Field(s):
Translation
Respond to list|Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|