AkiliCode-14B Research Preview
AkiliCode-14B is an early coding model from MsingiAI released as a research preview.
This release is published as:
msingiai/akilicode-14b-research-preview
Release artifact directory:
/outputs/akilicode-14b-research-preview
Source checkpoint:
/outputs/akili-code-stage3-ckpt100-s50-r4/checkpoint-20
It is a post-trained Qwen2.5-Coder-14B model optimized for structured reasoning and repair-oriented coding tasks. In MsingiAI's current internal evaluation stack, it shows promising function-level coding performance, but it is not yet strong on hidden-test competitive-programming benchmarks.
Research Preview Status
This model is being released for:
- research
- evaluation
- failure analysis
- downstream experimentation
This model is not being released as a state-of-the-art coding model or as a strong LiveCodeBench model.
Key Metrics
Promoted checkpoint results:
| Benchmark | Result |
|---|---|
| HumanEval+ | 62.80 |
| MBPP+ | 65.61 |
| BigCodeBench-Instruct | 45.09 |
| CRUXEval-O | 49.75 |
| LiveCodeBench v6 official | 11.37 |
Important Caveat on LiveCodeBench
The main remaining weakness is hidden-test algorithmic correctness, not output parsing.
On the official-style LiveCodeBench v6 run:
n = 1055pass@1 = 11.37private tests used for all 1055 problemsextraction_success_rate = 100.0syntax_valid_rate = 98.58
Failure breakdown:
wrong_answer = 838timeout = 61runtime_error = 21syntax_error = 15extraction_failed = 0
This means the model is usually producing executable outputs, but it still struggles on hidden-test algorithmic generalization, especially on medium and hard competition-style problems.
Intended Output Format
AkiliCode-14B was trained to respond with two XML blocks in order:
<reasoning>...</reasoning><code>...</code>
The reasoning block is expected to contain these headings:
PLAN:TRACE:EDGE CASES:COMPLEXITY:
The code block is expected to contain only executable Python.
If your downstream stack only wants runnable code, extract the contents of <code>...</code> before execution.
Intended Uses
Recommended uses:
- structured code generation
- code-repair experiments
- benchmark research
- reasoning-format experiments
- evaluation harness development
Less suitable uses right now:
- competition-style hidden-test programming
- production-critical autonomous coding
- benchmark marketing claims about frontier coding performance
How to Use
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "msingiai/akilicode-14b-research-preview"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
prompt = """<|im_start|>system
You are Akili Code, an expert programming assistant built by MsingiAI.
Always respond with <reasoning> followed by <code>.
<|im_end|>
<|im_start|>user
Write a Python function that returns the longest palindromic substring of a string.
<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training Summary
AkiliCode-14B uses a three-stage post-training recipe:
- supervised fine-tuning
- white-box RL for first-pass execution accuracy
- repair-focused MURPHY / P-GRPO continuation
The promoted checkpoint was selected because it improved the strongest reliable function-level metrics relative to the Stage 2 golden checkpoint:
HumanEval+:60.98 -> 62.80MBPP+:65.08 -> 65.61
It regressed slightly on BigCodeBench-Instruct:
45.61 -> 45.09
Limitations
- weak performance on hidden-test competitive programming
- current benchmark profile is much stronger on short function-synthesis tasks than contest-style algorithmic tasks
- model behavior depends on downstream handling of the XML output format
- not validated for safety-critical or production-critical use
Contact
For research or partnership inquiries:
korir@msingiai.com
- Downloads last month
- 11
