|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectpt.ul.fc.di.nlx.lxServiceClient.LXClient
public class LXClient
Client of the LXService, a web service of language technology for Portuguese.
Constructor Summary | |
---|---|
LXClient(java.lang.String username)
Creates an LXClient object. |
Method Summary | |
---|---|
java.lang.String |
chunks(java.lang.String text)
Segments into sentences and paragraphs with LX-Chunker. Marks sentence boundaries with <s>...</s> and paragraph boundaries with < p > ...< / p > .Unwraps sentences split over different lines. See: accuracy of LX-Chunker. |
java.lang.String |
posTags(java.lang.String text)
Segments into sentences and paragraphs with LX-Chunker and into lexemes with LX-Tokenizer, and annotates with POS tags with LX-Tagger. Assigns a single morpho-syntactic tag, from the tagset below, to every token. |
java.lang.String |
tokenizes(java.lang.String text)
Segments into sentences and paragraphs with LX-Chunker and into lexemes with LX-Tokenizer. Tokenizes text into lexically relevant tokens. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public LXClient(java.lang.String username)
username
- the username required for authentication before the LXService (registered at LXService database of clients).Method Detail |
---|
public java.lang.String chunks(java.lang.String text) throws LXException
<s>...</s>
and paragraph boundaries with <
p
>
...<
/
p
>
.
text
- the text to be segmented: raw text of Portuguese (max size 10K characters).
LXException
- if an error occurs.public java.lang.String tokenizes(java.lang.String text) throws LXException
|
(vertical bar) symbol is instrumentally used to make explicit the token boundaries more clearly, which in the ouput are simply indicated by withespaces. um exemplo ->
|um|exemplo|
Expands contractions. Note that the first element of an expanded contraction is marked with an _
(underscore) symbol: do ->
|de_|o|
Marks spacing around punctuation or symbols. The \*
and the *
/
symbols indicate a space to the left and a space to the right, respectively:um, dois e três->
|um|,*
/
|dois|e|três|
5.3 ->
|5|.|3|
1. 2->
|1|.*
/
|2|
8 . 6Detaches clitic pronouns from the verb. The detached pronoun is marked with a->
|8|\*.*
/
|6|
-
(hyphen) symbol. When in mesoclisis, a -CL-
mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a #
(hash) symbol: dá-se-lho ->
|dá|-se|-lhe|-o|
afirmar-se-ia ->
|afirmar-CL-ia|-se|
vê-las ->
|vê#|-las|
This tool also resolves ambiguous strings. Depending on their particular occurrence, these strings can be tokenized in different ways. For instance: deste ->
|deste|
when occurring as a Verb
deste ->
|de|este|
when occurring as a contraction (Preposition + Demonstrative)<s>...</s>
and paragraph boundaries with <
p
>
...<
/
p
>
.
text
- the text to be segmented: raw text of Portuguese (max size 10K characters).
LXException
- if an error occurs.public java.lang.String posTags(java.lang.String text) throws LXException
/
(slash) symbol as separator: um exemplo ->
um/IA exemplo/CN
Each individual token in multi-token expressions gets the tag of that expression prefixed by L
and followed by the number of its position within the expression: de maneira a que ->
de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4
|
(vertical bar) symbol is instrumentally used to make explicit the token boundaries more clearly, which in the ouput are simply indicated by withespaces. um exemplo ->
|um|exemplo|
Expands contractions. Note that the first element of an expanded contraction is marked with an _
(underscore) symbol: do ->
|de_|o|
Marks spacing around punctuation or symbols. The \*
and the *
/
symbols indicate a space to the left and a space to the right, respectively:um, dois e três->
|um|,*
/
|dois|e|três|
5.3 ->
|5|.|3|
1. 2->
|1|.*
/
|2|
8 . 6Detaches clitic pronouns from the verb. The detached pronoun is marked with a->
|8|\*.*
/
|6|
-
(hyphen) symbol. When in mesoclisis, a -CL-
mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a #
(hash) symbol: dá-se-lho ->
|dá|-se|-lhe|-o|
afirmar-se-ia ->
|afirmar-CL-ia|-se|
vê-las ->
|vê#|-las|
This tool also resolves ambiguous strings. Depending on their particular occurrence, these strings can be tokenized in different ways. For instance: deste ->
|deste|
when occurring as a Verb
deste ->
|de|este|
when occurring as a contraction (Preposition + Demonstrative)<s>...</s>
and paragraph boundaries with <
p
>
...<
/
p
>
.Tag | Category | Examples |
---|---|---|
ADJ | Adjectives | bom, brilhante, eficaz, … |
ADV | Adverbs | hoje, já, sim, felizmente, … |
CARD | Cardinals | zero, dez, cem, mil, … |
CJ | Conjunctions | e, ou, tal como, … |
CL | Clitics | o, lhe, se, … |
CN | Common Nouns | computador, cidade, ideia, … |
DA | Definite Articles | o, os, … |
DEM | Demonstratives | este, esses, aquele, … |
DFR | Denominators of Fractions | meio, terço, décimo, %, … |
DGTR | Roman Numerals | VI, LX, MMIII, MCMXCIX, … |
DGT | Digits | 0, 1, 42, 12345, 67890, … |
DM | Discourse Marker | olá, … |
EADR | Electronic Addresses | http://www.di.fc.ul.pt, … |
EOE | End of Enumeration | etc |
EXC | Exclamative | ah, ei, etc. |
GER | Gerunds | sendo, afirmando, vivendo, … |
GERAUX | Gerund "ter"/"haver" in compound tenses | tendo, havendo … |
IA | Indefinite Articles | uns, umas, … |
IND | Indefinites | tudo, alguém, ninguém, … |
INF | Infinitive | ser, afirmar, viver, … |
INFAUX | Infinitive "ter"/"haver" in compound tenses | ter, haver … |
INT | Interrogatives | quem, como, quando, … |
ITJ | Interjection | bolas, caramba, … |
LTR | Letters | a, b, c, … |
MGT | Magnitude Classes | unidade, dezena, dúzia, resma, … |
MTH | Months | Janeiro, Dezembro, … |
NP | Noun Phrases | idem, … |
ORD | Ordinals | primeiro, centésimo, penúltimo, … |
PADR | Part of Address | Rua, av., rot., … |
PNM | Part of Name | Lisboa, António, João, … |
PNT | Punctuation Marks | ., ?, (, … |
POSS | Possessives | meu, teu, seu, … |
PPA | Past Participles not in compound tenses | sido, afirmados, vivida, … |
PP | Prepositional Phrases | algures, … |
PPT | Past Participle in compound tenses | sido, afirmado, vivido, … |
PREP | Prepositions | de, para, em redor de, … |
PRS | Personals | eu, tu, ele, … |
QNT | Quantifiers | todos, muitos, nenhum, … |
REL | Relatives | que, cujo, tal que, … |
STT | Social Titles | Presidente, drª., prof., … |
SYB | Symbols | @, #, &, … |
TERMN | Optional Terminations | (s), (as), … |
UM | "um" or "uma" | um, uma |
UNIT | Abbreviated Measurement Units | kg., km., etc. |
VAUX | Finite "ter" or "haver" in compound tenses | temos, haveriam, … |
V | Verbs (other than PPA, PPT, INF or GER) | falou, falaria, … |
WD | Week Days | segunda, terça-feira, sábado, … |
Multi-Word Expressions | ||
LADV1…LADVn | Multi-Word Adverbs | de facto, em suma, um pouco, … |
LCJ1…LCJn | Multi-Word Conjunctions | assim como, já que, … |
LDEM1…LDEMn | Multi-Word Demonstratives | o mesmo, … |
LDFR1…LDFRn | Multi-Word Denominators of Fractions | por cento |
LDM1…LDMn | Multi-Word Discourse Markers | pois não, até logo, … |
LITJ1…LITJn | Multi-Word Interjections | meu Deus |
LPRS1…LPRSn | Multi-Word Personals | a gente, si mesmo, V. Exa., … |
LPREP1…LPREPn | Multi-Word Prepositions | através de, a partir de, … |
LQD1…LQDn | Multi-Word Quantifiers | uns quantos, … |
LREL1…LRELn | Multi-Word Relatives | tal como, … |
Tag | Description |
---|---|
m | Masculine |
f | Feminine |
s | Singular |
p | Plural |
dim | Diminutive |
sup | Superlative |
comp | Comparative |
1 | First Person |
2 | Second Person |
3 | Third Person |
pi | Presente do Indicativo |
ppi | Pretérito Perfeito do Indicativo |
ii | Pretérito Imperfeito do Indicativo |
mpi | Pretérito Mais que Perfeito do Indicativo |
fi | Futuro do Indicativo |
c | Condicional |
pc | Presente do Conjuntivo |
ic | Pretérito Imperfeito do Conjuntivo |
fc | Futuro do Conjuntivo |
imp | Imperativo |
text
- the text to be POS tagged: raw text of Portuguese (max size 10K characters).
LXException
- if an error occurs.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |