Of the process described above, including analysis,transfer and synthetsis. At the other extreme are purely statistical systems that eschew linguistic analysis per se. Such systems rely on algorithms that can learn correspondences between words and phrasesin existing translations without worrying about grammar. For example,such a system will observe, in existing french – english translated data, that most of the time when the word maison appears in french, the word house appears in english.
The system records this correspondence together with the probability that it occurs. Such system can also learn the contextual conditions under which alternate translations should be used.
The strengths of data-driven systems are that they overcome human bias in making observations about how language will be used. The training/learning process is largery, if not wholly, automated and can save much of the time and effort of development nad customization. Customization, discussed in section 4.2 below, is very important to the success of many MT deployments. The weakness of data driven approachess is that they require significant amounts of data to learn to translate general text. 1 million bilingual sentence pairs has been suggested as a to be very computer-resource intensive, requiring powerful processors and plenty of memory to translate in real time, as figure 1 illustrates.
The “translation rules” learned by statistical systems consist of “parameters”, cross-lingual correspondences between words or phrases, accompanied by the probablility that the word or phrases in the source language will be rendered as the sentence correspondences must be estabilished, and words separated from punctuation , or “tokenized”, it is this aligned,tokenized “parallel corpus” that a statisical system learns from.
Figure1. typically, the greater the degree of automation in system development (learning of analysis and translation rules), the shallower the analysis the system performs. In the extrime case, learning is fully automated, and the system uses no conventional grammar or lexicon.
In building rule-based systems, we noted that appropriate dictionary coverage
Is crusial to translation quality. In building data-driven systems, it is important that the training material be representative of the text to be translated so that the learned parameters, which are analagous to the dictionary, contain the necessary terms.
2.1.4 how are you languages targeted for development?
Development of translation systems so far has been an extensive,multi- year under-taking. Typically, each language pair is work of several people requiring at least 11/2 to 3 years for a commercial-quality release of the rule-based system. Automatically learned system can be developed more quickly where training material is available, and the sould see some of these emerging in the next few years. In addition to the initial cost of creating the system , maintenance and suport are necessary for as long as the system is distributed to users. The high costs of development and maintenance are important factors limiting thelanguage pairs considered for development. Because of the high initial investment required and long time to market , there is considerable commercial risk inherent in the development of translation system for new languages.
sumber : google
bahasa ingris
06.32 |
Di poskan oleh Djabarrudin syahroni
Langganan:
Posting Komentar (Atom)
0 komentar:
Posting Komentar