Natural Language in a Nutshell

Rupomcse23
3 min readApr 17, 2023

--

Language:

  • Language is a structured system of communication,
  • Natural Languages are spoken,
  • It can be encoded into secondary media — writing, whistling, signing, braille, etc.,
  • Over 5000 human Languages,
  • Natural Languages evolve amongst humans,
  • Formal Languages are created with well-defined syntax and rules to serve a specific purpose,
  • e.g., Computer programming languages.

Spoken Language vs. Written Language:

  • Spoken Languages existed forever,
  • Written Languages developed only a few thousand years ago,
  • Spoken Languages have many nuances that cannot be represented in writing,
  • Combines facial gestures and changes in the production of sounds or gestures.
  • These are called paralinguistic features.
Spoken Language vs. Written Language

Spoken Language:

Linguistics

  • Linguistics is the study of human language,
  • Consists of many components: Characters, words, sentences, etc.
  • Human Language is composed of four main building blocks:
  • Phonemes,
  • Morphemes/Lexemes,
  • Syntax, and
  • Context/Pragmatics.

Phonemes

  • Phonemes are the smallest units of sound in a language,
  • They help provide meanings in groups,
  • Useful for speech recognition, speech-to-text transcription, and text-to-speech conversion.
Phonemes in the English language

Morphemes/Lexemes

  • A Morpheme is the smallest unit of language that has a meaning,
  • Consist of a combination of phonemes,
  • Lexem — derive meaning from related words,
  • E.g., run, runs, ran and running are forms of the same lexeme “run.”

Syntax

  • The syntax is a set of rules to construct grammatically correct sentences from words and phrases,
  • The hierarchical structure of language with words at the bottom, then parts of speech, phrases and finally a sentence,
  • Parsing and entity/relation extraction is the NLP task of constructing trees automatically.

Syntax — Phrase structure trees

  • Art — A
  • N — Test
  • Prep — of
  • N — Natural Language Processing.
Phrase structure trees
  • S -Sentence
  • NP — Noun Phrase
  • VP — Verb Phrase
  • Art — Article
  • AP — Adjective Phrase
  • N — Noun
  • PP — Prepositional Phrase
  • Aux — Helping Verb
  • P — Prepositon

Context

  • Context helps convey a specific meaning,
  • Semantics is the direct meaning of words and sentences,
  • Pragmatics adds world knowledge, common sense and external context,
  • Complex NLP tasks are detecting sarcasm, summarization, and topic modeling.

Sociolinguistics

  • Sociolinguistics is the study of language in society,
  • It is the interdisciplinary field of sociology and linguistics,
  • Language varieties, including dialects and slang,
  • Socilolect: Social dialect,
  • Idiolect: Collection of varieties that an individual speaks,
  • Registers: The way a speaker uses language differently in different circumstances.
  • Determined by social occasions, context, purpose, and audience.

Writing Systems

  • Alphabets: Phonetic-based writing system that represents consonants and vowels
  • E.g., Latin, Cyrillic, Greek
  • Abjads: Phonetic-based writing system, mostly consonants and optionally vowels. left to right
  • E.g., Arabic, Hebrew
  • Abugidas: Each character represents a syllable
  • E.g., Devanagari, Thai
  • Syllabaries: Phonetic system, different sybols
  • E.g., Hiragana, Tsalagi (native american language)
  • Logographs: Combination of semantics and phonetics
  • E.g., Han Chinese

Encodings

  • ASCII: American Standard Code for Information Interchange
  • Control characters, some language characters
  • 128 Characters max
ASCII chart
  • Unicode: The Unicode Standard
  • A standardized set of character-number mappings maintained by the Unicode Consortium
  • UTF-8: Variable width characters
Unicode chart

--

--

Rupomcse23
Rupomcse23

Written by Rupomcse23

Data Scientist, familiar with Data Collection, processing, visualization and machine learning model creation to solve problems and find insights.