Skip to main content

Content Analysis Pipeline

Analyze text content to extract metrics and detect patterns - essential for content moderation, SEO analysis, or feature extraction.

Use Case

Multiple Applications

Content analysis pipelines serve content moderation (keyword detection), SEO analysis (content length, keyword density), and ML feature engineering (text statistics).

Given a piece of text content, you want to:

  • Count words and characters
  • Check for specific keywords
  • Split into processable lines

The Pipeline

# content-analysis.cst
# Analyze text content for metrics and keywords

in content: String
in keyword: String

# Basic metrics
wordCount = WordCount(content)
charCount = TextLength(content)

# Keyword detection
containsKeyword = Contains(content, keyword)

# Normalize for analysis
normalized = Lowercase(content)

# Output analysis results
out wordCount
out charCount
out containsKeyword
out normalized

Explanation

FunctionDescriptionReturn Type
WordCountCounts words (space-separated tokens)Int
TextLengthCounts characters in stringInt
ContainsChecks if substring existsBoolean
LowercaseConverts to lowercaseString

Running the Example

Input

{
"content": "Hello World! This is a Test Document.",
"keyword": "Test"
}

Expected Output

{
"wordCount": 7,
"charCount": 38,
"containsKeyword": true,
"normalized": "hello world! this is a test document."
}

Variations

Multi-line Content Analysis

Process content that spans multiple lines:

in content: String

# Split into lines
lines = SplitLines(content)

# Get line count
lineCount = list-length(lines)

# Get first line for preview
firstLine = list-first(lines)

out lineCount
out firstLine

CSV-like Data Parsing

Not a Full CSV Parser

This pattern handles simple delimited data. For complex CSV with quoted fields, escaping, or multi-line values, use a dedicated parsing module.

Parse delimited data:

in row: String
in delimiter: String

# Split by delimiter
fields = Split(row, delimiter)

# Get field count
fieldCount = list-length(fields)

out fields
out fieldCount

Input:

{
"row": "John,Doe,john@example.com,42",
"delimiter": ","
}

Content Quality Score

Combine metrics into a quality assessment:

use stdlib.math
use stdlib.comparison

in content: String

wordCount = WordCount(content)
charCount = TextLength(content)

# Check minimum thresholds
hasMinWords = gte(wordCount, 10)
hasMinChars = gte(charCount, 50)

out wordCount
out charCount
out hasMinWords
out hasMinChars

Keyword Density Analysis

Analyze keyword presence in content:

in content: String
in keyword: String

# Get content metrics
contentWords = WordCount(content)
contentLength = TextLength(content)

# Check keyword presence
hasKeyword = Contains(content, keyword)

# Normalize for case-insensitive check
normalizedContent = Lowercase(content)
normalizedKeyword = Lowercase(keyword)
hasKeywordNormalized = Contains(normalizedContent, normalizedKeyword)

out hasKeyword
out hasKeywordNormalized
out contentWords

Real-World Applications

Content Moderation

  • Check for banned keywords
  • Validate minimum content length
  • Extract content for review

SEO Analysis

  • Count keyword occurrences
  • Measure content length
  • Analyze content structure

Data Extraction

  • Parse structured text (CSV, logs)
  • Split multi-line content
  • Extract specific fields

Feature Engineering

  • Generate text features for ML models
  • Normalize text for comparison
  • Create content fingerprints

Best Practices

Case Sensitivity

Keyword detection is case-sensitive by default. Always normalize both content and keyword to the same case before using Contains.

  1. Normalize before comparing: Use Lowercase for case-insensitive matching
  2. Trim inputs: Clean whitespace before analysis with Trim
  3. Handle edge cases: Empty strings, single words, etc.