Shuu12121/Owl-ph2-len2048 πŸ¦‰

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Shuu12121/Owl-ph2-base-len2048
  • Maximum Sequence Length: 1024 tokens (2048 tokens during pretraining)
  • Output Dimensionality: 768
  • Similarity Function: Cosine Similarity

This model is a SentenceTransformer variant of Shuu12121/Owl-ph2-base-len2048. It was trained on the Owl corpus for code search and code-text retrieval. The training data consists of roughly 100,000 samples per language (800,640 pairs in total), and the model was trained for 1 epoch with a learning rate of 1e-5.

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Intended Uses

This model is intended for:

  • code search
  • code-text retrieval
  • semantic similarity
  • dense embedding generation for source code and natural language

Usage

Direct Usage (Sentence Transformers)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Shuu12121/Owl-ph2-len2048")

Training Details

Training Dataset

This model was trained on the Owl corpus, a dataset constructed for code search and code-text retrieval. The training set contains approximately 100,000 samples per language, resulting in 800,640 training pairs in total.

Training Hyperparameters

  • Learning rate: 1e-5
  • Epochs: 1
  • Loss: MultipleNegativesRankingLoss

Integrations

Owl-CLI

This model is used as the embedding model in Owl-CLI, a command-line tool for semantic code search.

Owl-CLI indexes source code at the function level, generates dense embeddings using this model, and performs vector similarity search to retrieve relevant code for natural language queries.

Key features of Owl-CLI include:

  • Semantic code search using dense embeddings
  • Function-level indexing with file paths and line numbers
  • Automatic indexing on first search
  • Differential embedding cache to avoid re-embedding unchanged files
  • JSON output for tool integration
  • MCP server support for integration with AI coding agents (e.g., Claude Code)

Repository:
https://github.com/Shun0212/Owl-CLI

Downloads last month
119
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Shuu12121/Owl-ph2-len2048

Finetuned
(1)
this model