Back to home page

EIC code displayed by LXR

 
 

    


Warning, file /include/unicode/ushape.h was not indexed or was modified since last indexation (in which case cross-reference links may be missing, inaccurate or erroneous).

0001 // © 2016 and later: Unicode, Inc. and others.
0002 // License & terms of use: http://www.unicode.org/copyright.html
0003 /*
0004 ******************************************************************************
0005 *
0006 *   Copyright (C) 2000-2012, International Business Machines
0007 *   Corporation and others.  All Rights Reserved.
0008 *
0009 ******************************************************************************
0010 *   file name:  ushape.h
0011 *   encoding:   UTF-8
0012 *   tab size:   8 (not used)
0013 *   indentation:4
0014 *
0015 *   created on: 2000jun29
0016 *   created by: Markus W. Scherer
0017 */
0018 
0019 #ifndef __USHAPE_H__
0020 #define __USHAPE_H__
0021 
0022 #include "unicode/utypes.h"
0023 
0024 /**
0025  * \file
0026  * \brief C API:  Arabic shaping
0027  * 
0028  */
0029 
0030 /**
0031  * Shape Arabic text on a character basis.
0032  *
0033  * <p>This function performs basic operations for "shaping" Arabic text. It is most
0034  * useful for use with legacy data formats and legacy display technology
0035  * (simple terminals). All operations are performed on Unicode characters.</p>
0036  *
0037  * <p>Text-based shaping means that some character code points in the text are
0038  * replaced by others depending on the context. It transforms one kind of text
0039  * into another. In comparison, modern displays for Arabic text select
0040  * appropriate, context-dependent font glyphs for each text element, which means
0041  * that they transform text into a glyph vector.</p>
0042  *
0043  * <p>Text transformations are necessary when modern display technology is not
0044  * available or when text needs to be transformed to or from legacy formats that
0045  * use "shaped" characters. Since the Arabic script is cursive, connecting
0046  * adjacent letters to each other, computers select images for each letter based
0047  * on the surrounding letters. This usually results in four images per Arabic
0048  * letter: initial, middle, final, and isolated forms. In Unicode, on the other
0049  * hand, letters are normally stored abstract, and a display system is expected
0050  * to select the necessary glyphs. (This makes searching and other text
0051  * processing easier because the same letter has only one code.) It is possible
0052  * to mimic this with text transformations because there are characters in
0053  * Unicode that are rendered as letters with a specific shape
0054  * (or cursive connectivity). They were included for interoperability with
0055  * legacy systems and codepages, and for unsophisticated display systems.</p>
0056  *
0057  * <p>A second kind of text transformations is supported for Arabic digits:
0058  * For compatibility with legacy codepages that only include European digits,
0059  * it is possible to replace one set of digits by another, changing the
0060  * character code points. These operations can be performed for either
0061  * Arabic-Indic Digits (U+0660...U+0669) or Eastern (Extended) Arabic-Indic
0062  * digits (U+06f0...U+06f9).</p>
0063  *
0064  * <p>Some replacements may result in more or fewer characters (code points).
0065  * By default, this means that the destination buffer may receive text with a
0066  * length different from the source length. Some legacy systems rely on the
0067  * length of the text to be constant. They expect extra spaces to be added
0068  * or consumed either next to the affected character or at the end of the
0069  * text.</p>
0070  *
0071  * <p>For details about the available operations, see the description of the
0072  * <code>U_SHAPE_...</code> options.</p>
0073  *
0074  * @param source The input text.
0075  *
0076  * @param sourceLength The number of UChars in <code>source</code>.
0077  *
0078  * @param dest The destination buffer that will receive the results of the
0079  *             requested operations. It may be <code>NULL</code> only if
0080  *             <code>destSize</code> is 0. The source and destination must not
0081  *             overlap.
0082  *
0083  * @param destSize The size (capacity) of the destination buffer in UChars.
0084  *                 If <code>destSize</code> is 0, then no output is produced,
0085  *                 but the necessary buffer size is returned ("preflighting").
0086  *
0087  * @param options This is a 32-bit set of flags that specify the operations
0088  *                that are performed on the input text. If no error occurs,
0089  *                then the result will always be written to the destination
0090  *                buffer.
0091  *
0092  * @param pErrorCode must be a valid pointer to an error code value,
0093  *        which must not indicate a failure before the function call.
0094  *
0095  * @return The number of UChars written to the destination buffer.
0096  *         If an error occurred, then no output was written, or it may be
0097  *         incomplete. If <code>U_BUFFER_OVERFLOW_ERROR</code> is set, then
0098  *         the return value indicates the necessary destination buffer size.
0099  * @stable ICU 2.0
0100  */
0101 U_CAPI int32_t U_EXPORT2
0102 u_shapeArabic(const UChar *source, int32_t sourceLength,
0103               UChar *dest, int32_t destSize,
0104               uint32_t options,
0105               UErrorCode *pErrorCode);
0106 
0107 /**
0108  * Memory option: allow the result to have a different length than the source.
0109  * Affects: LamAlef options
0110  * @stable ICU 2.0
0111  */
0112 #define U_SHAPE_LENGTH_GROW_SHRINK              0
0113 
0114 /**
0115  * Memory option: allow the result to have a different length than the source.
0116  * Affects: LamAlef options
0117  * This option is an alias to U_SHAPE_LENGTH_GROW_SHRINK
0118  * @stable ICU 4.2
0119  */
0120 #define U_SHAPE_LAMALEF_RESIZE                  0 
0121 
0122 /**
0123  * Memory option: the result must have the same length as the source.
0124  * If more room is necessary, then try to consume spaces next to modified characters.
0125  * @stable ICU 2.0
0126  */
0127 #define U_SHAPE_LENGTH_FIXED_SPACES_NEAR        1
0128 
0129 /**
0130  * Memory option: the result must have the same length as the source.
0131  * If more room is necessary, then try to consume spaces next to modified characters.
0132  * Affects: LamAlef options
0133  * This option is an alias to U_SHAPE_LENGTH_FIXED_SPACES_NEAR
0134  * @stable ICU 4.2
0135  */
0136 #define U_SHAPE_LAMALEF_NEAR                    1 
0137 
0138 /**
0139  * Memory option: the result must have the same length as the source.
0140  * If more room is necessary, then try to consume spaces at the end of the text.
0141  * @stable ICU 2.0
0142  */
0143 #define U_SHAPE_LENGTH_FIXED_SPACES_AT_END      2
0144 
0145 /**
0146  * Memory option: the result must have the same length as the source.
0147  * If more room is necessary, then try to consume spaces at the end of the text.
0148  * Affects: LamAlef options
0149  * This option is an alias to U_SHAPE_LENGTH_FIXED_SPACES_AT_END
0150  * @stable ICU 4.2
0151  */
0152 #define U_SHAPE_LAMALEF_END                     2 
0153 
0154 /**
0155  * Memory option: the result must have the same length as the source.
0156  * If more room is necessary, then try to consume spaces at the beginning of the text.
0157  * @stable ICU 2.0
0158  */
0159 #define U_SHAPE_LENGTH_FIXED_SPACES_AT_BEGINNING 3
0160 
0161 /**
0162  * Memory option: the result must have the same length as the source.
0163  * If more room is necessary, then try to consume spaces at the beginning of the text.
0164  * Affects: LamAlef options
0165  * This option is an alias to U_SHAPE_LENGTH_FIXED_SPACES_AT_BEGINNING
0166  * @stable ICU 4.2
0167  */
0168 #define U_SHAPE_LAMALEF_BEGIN                    3 
0169 
0170 
0171 /**
0172  * Memory option: the result must have the same length as the source.
0173  * Shaping Mode: For each LAMALEF character found, expand LAMALEF using space at end.
0174  *               If there is no space at end, use spaces at beginning of the buffer. If there
0175  *               is no space at beginning of the buffer, use spaces at the near (i.e. the space
0176  *               after the LAMALEF character).
0177  *               If there are no spaces found, an error U_NO_SPACE_AVAILABLE (as defined in utypes.h) 
0178  *               will be set in pErrorCode
0179  *
0180  * Deshaping Mode: Perform the same function as the flag equals U_SHAPE_LAMALEF_END. 
0181  * Affects: LamAlef options
0182  * @stable ICU 4.2
0183  */
0184 #define U_SHAPE_LAMALEF_AUTO                     0x10000 
0185 
0186 /** Bit mask for memory options. @stable ICU 2.0 */
0187 #define U_SHAPE_LENGTH_MASK                      0x10003 /* Changed old value 3 */
0188 
0189 
0190 /**
0191  * Bit mask for LamAlef memory options.
0192  * @stable ICU 4.2
0193  */
0194 #define U_SHAPE_LAMALEF_MASK                     0x10003 /* updated */
0195 
0196 /** Direction indicator: the source is in logical (keyboard) order. @stable ICU 2.0 */
0197 #define U_SHAPE_TEXT_DIRECTION_LOGICAL          0
0198 
0199 /**
0200  * Direction indicator:
0201  * the source is in visual RTL order,
0202  * the rightmost displayed character stored first.
0203  * This option is an alias to U_SHAPE_TEXT_DIRECTION_LOGICAL
0204  * @stable ICU 4.2
0205  */
0206 #define U_SHAPE_TEXT_DIRECTION_VISUAL_RTL       0
0207 
0208 /**
0209  * Direction indicator:
0210  * the source is in visual LTR order,
0211  * the leftmost displayed character stored first.
0212  * @stable ICU 2.0
0213  */
0214 #define U_SHAPE_TEXT_DIRECTION_VISUAL_LTR       4
0215 
0216 /** Bit mask for direction indicators. @stable ICU 2.0 */
0217 #define U_SHAPE_TEXT_DIRECTION_MASK             4
0218 
0219 
0220 /** Letter shaping option: do not perform letter shaping. @stable ICU 2.0 */
0221 #define U_SHAPE_LETTERS_NOOP                    0
0222 
0223 /** Letter shaping option: replace abstract letter characters by "shaped" ones. @stable ICU 2.0 */
0224 #define U_SHAPE_LETTERS_SHAPE                   8
0225 
0226 /** Letter shaping option: replace "shaped" letter characters by abstract ones. @stable ICU 2.0 */
0227 #define U_SHAPE_LETTERS_UNSHAPE                 0x10
0228 
0229 /**
0230  * Letter shaping option: replace abstract letter characters by "shaped" ones.
0231  * The only difference with U_SHAPE_LETTERS_SHAPE is that Tashkeel letters
0232  * are always "shaped" into the isolated form instead of the medial form
0233  * (selecting code points from the Arabic Presentation Forms-B block).
0234  * @stable ICU 2.0
0235  */
0236 #define U_SHAPE_LETTERS_SHAPE_TASHKEEL_ISOLATED 0x18
0237 
0238 
0239 /** Bit mask for letter shaping options. @stable ICU 2.0 */
0240 #define U_SHAPE_LETTERS_MASK                        0x18
0241 
0242 
0243 /** Digit shaping option: do not perform digit shaping. @stable ICU 2.0 */
0244 #define U_SHAPE_DIGITS_NOOP                     0
0245 
0246 /**
0247  * Digit shaping option:
0248  * Replace European digits (U+0030...) by Arabic-Indic digits.
0249  * @stable ICU 2.0
0250  */
0251 #define U_SHAPE_DIGITS_EN2AN                    0x20
0252 
0253 /**
0254  * Digit shaping option:
0255  * Replace Arabic-Indic digits by European digits (U+0030...).
0256  * @stable ICU 2.0
0257  */
0258 #define U_SHAPE_DIGITS_AN2EN                    0x40
0259 
0260 /**
0261  * Digit shaping option:
0262  * Replace European digits (U+0030...) by Arabic-Indic digits if the most recent
0263  * strongly directional character is an Arabic letter
0264  * (<code>u_charDirection()</code> result <code>U_RIGHT_TO_LEFT_ARABIC</code> [AL]).<br>
0265  * The direction of "preceding" depends on the direction indicator option.
0266  * For the first characters, the preceding strongly directional character
0267  * (initial state) is assumed to be not an Arabic letter
0268  * (it is <code>U_LEFT_TO_RIGHT</code> [L] or <code>U_RIGHT_TO_LEFT</code> [R]).
0269  * @stable ICU 2.0
0270  */
0271 #define U_SHAPE_DIGITS_ALEN2AN_INIT_LR          0x60
0272 
0273 /**
0274  * Digit shaping option:
0275  * Replace European digits (U+0030...) by Arabic-Indic digits if the most recent
0276  * strongly directional character is an Arabic letter
0277  * (<code>u_charDirection()</code> result <code>U_RIGHT_TO_LEFT_ARABIC</code> [AL]).<br>
0278  * The direction of "preceding" depends on the direction indicator option.
0279  * For the first characters, the preceding strongly directional character
0280  * (initial state) is assumed to be an Arabic letter.
0281  * @stable ICU 2.0
0282  */
0283 #define U_SHAPE_DIGITS_ALEN2AN_INIT_AL          0x80
0284 
0285 /** Not a valid option value. May be replaced by a new option. @stable ICU 2.0 */
0286 #define U_SHAPE_DIGITS_RESERVED                 0xa0
0287 
0288 /** Bit mask for digit shaping options. @stable ICU 2.0 */
0289 #define U_SHAPE_DIGITS_MASK                     0xe0
0290 
0291 
0292 /** Digit type option: Use Arabic-Indic digits (U+0660...U+0669). @stable ICU 2.0 */
0293 #define U_SHAPE_DIGIT_TYPE_AN                   0
0294 
0295 /** Digit type option: Use Eastern (Extended) Arabic-Indic digits (U+06f0...U+06f9). @stable ICU 2.0 */
0296 #define U_SHAPE_DIGIT_TYPE_AN_EXTENDED          0x100
0297 
0298 /** Not a valid option value. May be replaced by a new option. @stable ICU 2.0 */
0299 #define U_SHAPE_DIGIT_TYPE_RESERVED             0x200
0300 
0301 /** Bit mask for digit type options. @stable ICU 2.0 */
0302 #define U_SHAPE_DIGIT_TYPE_MASK                 0x300 /* I need to change this from 0x3f00 to 0x300 */
0303 
0304 /** 
0305  * Tashkeel aggregation option:
0306  * Replaces any combination of U+0651 with one of
0307  * U+064C, U+064D, U+064E, U+064F, U+0650 with
0308  * U+FC5E, U+FC5F, U+FC60, U+FC61, U+FC62 consecutively.
0309  * @stable ICU 3.6
0310  */
0311 #define U_SHAPE_AGGREGATE_TASHKEEL              0x4000
0312 /** Tashkeel aggregation option: do not aggregate tashkeels. @stable ICU 3.6 */
0313 #define U_SHAPE_AGGREGATE_TASHKEEL_NOOP         0
0314 /** Bit mask for tashkeel aggregation. @stable ICU 3.6 */
0315 #define U_SHAPE_AGGREGATE_TASHKEEL_MASK         0x4000
0316 
0317 /** 
0318  * Presentation form option:
0319  * Don't replace Arabic Presentation Forms-A and Arabic Presentation Forms-B
0320  * characters with 0+06xx characters, before shaping.
0321  * @stable ICU 3.6
0322  */
0323 #define U_SHAPE_PRESERVE_PRESENTATION           0x8000
0324 /** Presentation form option: 
0325  * Replace Arabic Presentation Forms-A and Arabic Presentationo Forms-B with 
0326  * their unshaped correspondents in range 0+06xx, before shaping.
0327  * @stable ICU 3.6 
0328  */
0329 #define U_SHAPE_PRESERVE_PRESENTATION_NOOP      0
0330 /** Bit mask for preserve presentation form. @stable ICU 3.6 */
0331 #define U_SHAPE_PRESERVE_PRESENTATION_MASK      0x8000
0332 
0333 /* Seen Tail option */ 
0334 /**
0335  * Memory option: the result must have the same length as the source.
0336  * Shaping mode: The SEEN family character will expand into two characters using space near 
0337  *               the SEEN family character(i.e. the space after the character).
0338  *               If there are no spaces found, an error U_NO_SPACE_AVAILABLE (as defined in utypes.h) 
0339  *               will be set in pErrorCode
0340  *
0341  * De-shaping mode: Any Seen character followed by Tail character will be
0342  *                  replaced by one cell Seen and a space will replace the Tail.
0343  * Affects: Seen options
0344  * @stable ICU 4.2
0345  */
0346 #define U_SHAPE_SEEN_TWOCELL_NEAR     0x200000
0347 
0348 /**
0349  * Bit mask for Seen memory options. 
0350  * @stable ICU 4.2
0351  */
0352 #define U_SHAPE_SEEN_MASK             0x700000
0353 
0354 /* YehHamza option */ 
0355 /**
0356  * Memory option: the result must have the same length as the source.
0357  * Shaping mode: The YEHHAMZA character will expand into two characters using space near it 
0358  *              (i.e. the space after the character
0359  *               If there are no spaces found, an error U_NO_SPACE_AVAILABLE (as defined in utypes.h) 
0360  *               will be set in pErrorCode
0361  *
0362  * De-shaping mode: Any Yeh (final or isolated) character followed by Hamza character will be
0363  *                  replaced by one cell YehHamza and space will replace the Hamza.
0364  * Affects: YehHamza options
0365  * @stable ICU 4.2
0366  */
0367 #define U_SHAPE_YEHHAMZA_TWOCELL_NEAR      0x1000000
0368 
0369 
0370 /**
0371  * Bit mask for YehHamza memory options. 
0372  * @stable ICU 4.2
0373  */
0374 #define U_SHAPE_YEHHAMZA_MASK              0x3800000
0375 
0376 /* New Tashkeel options */ 
0377 /**
0378  * Memory option: the result must have the same length as the source.
0379  * Shaping mode: Tashkeel characters will be replaced by spaces. 
0380  *               Spaces will be placed at beginning of the buffer
0381  *
0382  * De-shaping mode: N/A
0383  * Affects: Tashkeel options
0384  * @stable ICU 4.2
0385  */
0386 #define U_SHAPE_TASHKEEL_BEGIN                      0x40000
0387 
0388 /**
0389  * Memory option: the result must have the same length as the source.
0390  * Shaping mode: Tashkeel characters will be replaced by spaces. 
0391  *               Spaces will be placed at end of the buffer
0392  *
0393  * De-shaping mode: N/A
0394  * Affects: Tashkeel options
0395  * @stable ICU 4.2
0396  */
0397 #define U_SHAPE_TASHKEEL_END                        0x60000
0398 
0399 /**
0400  * Memory option: allow the result to have a different length than the source.
0401  * Shaping mode: Tashkeel characters will be removed, buffer length will shrink. 
0402  * De-shaping mode: N/A 
0403  *
0404  * Affect: Tashkeel options
0405  * @stable ICU 4.2
0406  */
0407 #define U_SHAPE_TASHKEEL_RESIZE                     0x80000
0408 
0409 /**
0410  * Memory option: the result must have the same length as the source.
0411  * Shaping mode: Tashkeel characters will be replaced by Tatweel if it is connected to adjacent
0412  *               characters (i.e. shaped on Tatweel) or replaced by space if it is not connected.
0413  *
0414  * De-shaping mode: N/A
0415  * Affects: YehHamza options
0416  * @stable ICU 4.2
0417  */
0418 #define U_SHAPE_TASHKEEL_REPLACE_BY_TATWEEL         0xC0000
0419 
0420 /** 
0421  * Bit mask for Tashkeel replacement with Space or Tatweel memory options. 
0422  * @stable ICU 4.2
0423  */
0424 #define U_SHAPE_TASHKEEL_MASK                       0xE0000
0425 
0426 
0427 /* Space location Control options */ 
0428 /**
0429  * This option affect the meaning of BEGIN and END options. if this option is not used the default
0430  * for BEGIN and END will be as following: 
0431  * The Default (for both Visual LTR, Visual RTL and Logical Text)
0432  *           1. BEGIN always refers to the start address of physical memory.
0433  *           2. END always refers to the end address of physical memory.
0434  *
0435  * If this option is used it will swap the meaning of BEGIN and END only for Visual LTR text. 
0436  *
0437  * The effect on BEGIN and END Memory Options will be as following:
0438  *    A. BEGIN For Visual LTR text: This will be the beginning (right side) of the visual text(
0439  *       corresponding to the physical memory address end for Visual LTR text, Same as END in 
0440  *       default behavior)
0441  *    B. BEGIN For Logical text: Same as BEGIN in default behavior. 
0442  *    C. END For Visual LTR text: This will be the end (left side) of the visual text (corresponding
0443  *       to the physical memory address beginning for Visual LTR text, Same as BEGIN in default behavior.
0444  *    D. END For Logical text: Same as END in default behavior). 
0445  * Affects: All LamAlef BEGIN, END and AUTO options.
0446  * @stable ICU 4.2
0447  */
0448 #define U_SHAPE_SPACES_RELATIVE_TO_TEXT_BEGIN_END 0x4000000
0449 
0450 /**
0451  * Bit mask for swapping BEGIN and END for Visual LTR text 
0452  * @stable ICU 4.2
0453  */
0454 #define U_SHAPE_SPACES_RELATIVE_TO_TEXT_MASK      0x4000000
0455 
0456 /**
0457  * If this option is used, shaping will use the new Unicode code point for TAIL (i.e. 0xFE73). 
0458  * If this option is not specified (Default), old unofficial Unicode TAIL code point is used (i.e. 0x200B)
0459  * De-shaping will not use this option as it will always search for both the new Unicode code point for the 
0460  * TAIL (i.e. 0xFE73) or the old unofficial Unicode TAIL code point (i.e. 0x200B) and de-shape the
0461  * Seen-Family letter accordingly.
0462  *
0463  * Shaping Mode: Only shaping.
0464  * De-shaping Mode: N/A.
0465  * Affects: All Seen options
0466  * @stable ICU 4.8
0467  */
0468 #define U_SHAPE_TAIL_NEW_UNICODE        0x8000000
0469 
0470 /**
0471  * Bit mask for new Unicode Tail option 
0472  * @stable ICU 4.8
0473  */
0474 #define U_SHAPE_TAIL_TYPE_MASK          0x8000000
0475 
0476 #endif