Skip to content

Spreadsheet parser consolidation (temporary planning doc for PR #2592)

Temporary planning document. Delete once the consolidation in this PR (or the follow-up PR it scopes out) has landed and a decision record has been captured in /docs/decisions/.

Why this document exists

PR #2591's preview-environment testing surfaced an import failure (issue narrative) whose root cause is that the screening-value tolerance introduced by PR #2509 (shipped in the April 2026 user-guide release as "more flexible screening values") was applied to one of three parser paths. The other two paths still call (ScreeningDecision) int.Parse(columnStringValue), which throws FormatException on any non-numeric input. Rather than silently fan the fix out to the two laggard paths, this doc names the duplication so we can decide whether to consolidate now or later.

User-facing symptom

On the syrf_pr_2591 preview database, a systematic-search upload named TestWithSecreening to the "Screening In Progress" project transitioned to the Error state at 2026-04-24T16:21:08. The PM service log shows:

System.FormatException: The input string 'Include' was not in a correct format.
  at System.Int32.Parse(String s)
  at SpreadsheetParseImplementation.<>c__DisplayClass22_1.<ParseStudiesAsync>b__2
    (in SpreadsheetParseImplementation.cs:line 168)

The user-facing impression: "your fix for tolerant screening values isn't actually working on this branch." Reality: the fix never landed on the search-import code path in the first place.

The three parsers

All three live under src/libs/project-management/SyRF.ProjectManagement.Core/Services/ParserImplementations/. They all contain a ScreeningImportSettings?.UserColumnMap.ForEach(...) block that interprets a per-row column value as a ScreeningDecision.

1. SpreadsheetParseImplementation.cs — legacy search-import

  • Directory: ParserImplementations/ (alongside Endnote/Pubmed XML parsers).
  • Interface: IStudyParseImplementation (resolved by ParserRegistry on the basis of LibraryType).
  • Used by: StudyReferenceFileParser.ParseStudiesAsync — the production code path for systematic-search CSV imports (RIS, SyRF library, EndNote export, etc.).
  • Screening-value parse: (ScreeningDecision) int.Parse(columnStringValue) at line 170. Throws on non-numeric input. This is the code path the user hit.
  • Status in this PR: to be converted to ScreeningDecisionParser.TryParse.

2. NewParser/StudySpreadsheetRecordProcessor.cs — dead code

  • Directory: ParserImplementations/NewParser/.
  • Interface: IRecordProcessor<Study> (new-style: reader + processor split via NewSpreadsheetReader).
  • Used by: nothing in production. The only callers outside the file itself are its own test suite (StudySpreadsheetRecordProcessorTests).
  • Screening-value parse: (ScreeningDecision) int.Parse(GetValueFromColumn(recordFields, u.Key)) at line 166. Same bug, just not reachable from production.
  • Status in this PR: to be converted to ScreeningDecisionParser.TryParse as well, so that if/when we wire it up it inherits the tolerant behaviour. (Not deleting the file here — see "Consolidation options" below.)

3. NewParser/StudyUpdateRecordProcessor.cs — bulk-study-update

  • Directory: ParserImplementations/NewParser/.
  • Interface: IRecordProcessor<StudyUpdate>.
  • Used by: StudyReferenceFileParser.ProcessStreamToUpdates — the production code path for bulk study update CSVs.
  • Screening-value parse: already uses ScreeningDecisionParser.TryParse since PR #2509 (2026-04-07). Surfaces a structured "accepted values" error instead of an unhandled exception.
  • Status in this PR: unchanged (reference implementation).

Why there are two "spreadsheet" parsers

The NewParser/ directory is an in-flight but incomplete migration from a monolithic IStudyParseImplementation per file format (spreadsheet/endnote/pubmed) to a cleaner reader-plus-processor architecture (IRecordReader<TIn> + IRecordProcessor<TIn, TOut>). Evidence:

  • The new architecture only has a production consumer for the StudyUpdate processor (the bulk-update flow, added as part of that feature).
  • StudySpreadsheetRecordProcessor and StudyEndnoteRecordProcessor were written as planned replacements for SpreadsheetParseImplementation and EndnoteXmlParseImplementation, and both have their own test suites — but StudyReferenceFileParser.ParseStudiesAsync still uses the legacy IStudyParseImplementation collection via .First(spi => spi.HandlesTypes.ContainsKey(searchImportJob.LibraryType)).
  • Several static field references in StudySpreadsheetRecordProcessor reach back into the legacy file (SpreadsheetParseImplementation.ReferenceType, SpreadsheetParseImplementation.PublicationName, etc.) because the header constants were never extracted into a shared SpreadsheetFieldNames class.

So today the picture is:

 Search import ──► IStudyParseImplementation ──► SpreadsheetParseImplementation
                                            └──► EndnoteXmlParseImplementation
                                            └──► PubmedXmlParseImplementation
 Bulk update   ──► NewSpreadsheetReader ──► StudyUpdateRecordProcessor

 [Orphan]          NewSpreadsheetReader ──► StudySpreadsheetRecordProcessor
 [Orphan]          NewEndnoteReader     ──► StudyEndnoteRecordProcessor

Consolidation options

Option A — minimal: fan out the tolerant parser, leave structure alone (this PR)

  • Replace int.Parse with ScreeningDecisionParser.TryParse in both SpreadsheetParseImplementation.cs and StudySpreadsheetRecordProcessor.cs.
  • Add per-row ParseError entries (with ScreeningDecisionParser.AcceptedValuesDescription in the message) when a value can't be parsed, matching the pattern StudyUpdateRecordProcessor already uses.
  • Extend tests in SpreadsheetParseImplementationTests.cs and StudySpreadsheetRecordProcessorTests.cs to cover the same screening-value matrix the bulk-update tests already cover.
  • Update the April 2026 changelog on pr2567 to note that tolerant screening values now apply to search-library uploads as well as bulk updates.

Pros: closes the user-facing bug in a single small PR; easy to review; low regression risk. Cons: the three copies of the same per-row screening block remain in three files.

Option B — full migration (follow-up PR, scoped here)

  • Migrate StudyReferenceFileParser.ParseStudiesAsync off IStudyParseImplementation onto the NewSpreadsheetReader / NewEndnoteReader + StudySpreadsheetRecordProcessor / StudyEndnoteRecordProcessor pair.
  • Extract shared field-name constants from SpreadsheetParseImplementation into a SpreadsheetFieldNames static class so StudySpreadsheetRecordProcessor no longer reaches back into the legacy file.
  • Delete SpreadsheetParseImplementation, EndnoteXmlParseImplementation, and the IStudyParseImplementation / ParserRegistry scaffolding.
  • Folding the shared "convert UserColumnMap rows into Screenings" logic into a single private helper (on Study or a small domain service) so it isn't re-implemented by future processors.

Pros: deletes ~280 LOC of genuinely duplicated code; makes the PubMed XML parser the odd one out (which it already is) so the abstraction has one shape, not two. Cons: non-trivial migration — the legacy parser has several behaviours (quoting, throttled IProgress<T> reporting, tracked agreement thresholds) that need careful translation; ParserRegistry has opinions about LibraryType that the new-parser path doesn't replicate today. This is not safe to do under the same merge window as #2591.

Option C — delete dead code, accept the duplication for now

  • Delete NewParser/StudySpreadsheetRecordProcessor.cs and NewParser/StudyEndnoteRecordProcessor.cs (and their tests) on the grounds that they have no production consumer and no near-term plan to acquire one.
  • Apply tolerant parsing to only SpreadsheetParseImplementation.cs.

Pros: closes the immediate bug and removes a pile of dead code in one shot. Cons: any future re-migration to the reader/processor split (Option B) has to re-derive those processors from scratch. Also conflates a bug fix with a deletion decision we haven't explicitly agreed to.

Decision for this PR

Option A, with Option B scoped as a follow-up.

Deleting the two NewParser/Study{Spreadsheet,Endnote}RecordProcessor files (Option C) is tempting given they've sat unused since their introduction, but they do encode a reasonable interface decomposition for a migration we'll probably still want to do. Keeping them — and keeping them in sync with ScreeningDecisionParser — costs a small amount of code duplication for the duration of this PR; replacing them if we did delete them would cost much more.

Test plan for Option A (this PR)

  • Unit: extend SpreadsheetParseImplementationTests with tolerant-value parsing cases (1/0, include/exclude, yes/no, true/false, case variants, invalid value produces a structured ParseError, whitespace trimmed).
  • Unit: mirror those cases in StudySpreadsheetRecordProcessorTests.
  • Regression: ensure the existing happy-path tests (integer 0/1 values) still pass end-to-end.
  • Manual verification in the PR-2592 preview: re-upload TestWithSecreening (with text decisions) and confirm it now reaches the Completed state; the resulting pmStudy documents have the correct Included / Excluded screenings attached.

Follow-ups (not in this PR)

  • Option B migration — open a separate PR once #2591 and #2592 have merged. Reference this planning doc in the PR description so the historical context isn't lost.
  • User-guide note — extend the April 2026 release notes on pr2567 (under "Bulk study update → More flexible screening values") to cover search-library uploads once this PR lands.
  • ADR — once Option B lands, replace this planning doc with a decision record under /docs/decisions/ capturing the reader/processor split.