Non English to English word conversion
Requirement
Introduction: datasets often include text in multiple languages, especially when dealing with user-generated content, international customer feedback, or multilingual documents. To standardize these datasets for analysis, it is crucial to translate non-English text into an English language.
h3. Requirements:
- Functionality:
#* The script should translate non-English content in a specific column (text) of the 'purgo_playground.other_language' table to English.
#* The translation should detect the source language automatically and convert it to English.
- Input Details:
#* The input column may contain text in various languages or already in English.
- Translation Logic:
#* Automatically detect the language of the input text using the GoogleTranslator class from the deep-translator library.
#* If the text is already in English, retain it without modification.
- Final Output Details:
#* A PySpark DataFrame in which the ‘text' column has all non-English entries translated into English in to 'translated’ column, while the original English entries remain unchanged.
#* Save and OverWrite the output to the ‘purgo_playground.translated_other_language’ with text columns and converted columns. set mergeSchemaoption as true.
- Technology Stack:
# deep-translator Library*: To perform automatic language detection and translation using Google Translate.
Unity Catalog Details: ‘purgo_playground.other_language’
Prerequisites :
* Install deep-translator library using %pip command.
* Drop the target table 'purgo_playground.translated_other_language' if exist and perform the requirement.
Expected Output: Databricks Pyspark code