Cron Scheduling in Precog
A cron schedule is just a pattern that tells Precog when to load your data. You can start simple, refine over time, and separate workloads across pipelines for smoother performance. Cron scheduling is available in all pipelines under the “Advanced edit” option.
Understanding Java Cron Schedules for Data Loading
A cron schedule is simply a way to tell a software application like Precog when and how often to run a task automatically. For Precog, it means deciding when data should be loaded into your data warehouse.
A cron schedule looks like a short code with numbers (or symbols), each representing a unit of time:
- Minute → What minute of the hour to run (e.g., 0 = on the hour)
- Hour → What hour of the day (e.g., 2 = 2 AM)
- Day of Month → What day of the month (e.g., 1 = the 1st day)
- Month → Which month (e.g., 12 = December)
- Day of Week → Which day of the week (e.g., 1 = Monday)
Some schedules, like the one used in Java/Quartz cron format, include a seconds field at the beginning and sometimes require a ? in either the Day of Month or Day of Week field.
In Precog, cron schedules follow the UTC (Coordinated Universal Time) timezone by default. That means, if you set a schedule like 0 2 * * * (2:00 AM), it will run at 2:00 AM UTC, not your local time.
Examples for Data Loading
- 0 2 * * * → Run every day at 2:00 AM (common for nightly warehouse loads)
- 0 */4 * * * → Run every 4 hours (good for keeping data fresh throughout the day)
- 30 6 * * 1 → Run every Monday at 6:30 AM (weekly refresh)
- 0 0 1 * * → Run on the 1st day of every month at midnight (monthly batch)
- 0 0,15,30,45 11-16 * * ? → Run every 15 minutes between 11:00 AM and 4:59 PM (Quartz-style example)
Why It Matters for Data Loads
- Performance: You can schedule loads at off-peak hours (e.g., 2 AM) so your warehouse isn’t slowed during business hours.
- Freshness: Choose how often you want new data available (hourly, nightly, weekly).
- Flexibility: Mix schedules (daily + monthly) for different data sources.
Practical Advice
A little trial and error is normal. Cron schedules—especially with the Quartz format—can take some experimenting, particularly when deciding where to place the ?. Looking at working examples often helps.
Watch for long-running datasets. If a dataset takes longer to load than the schedule allows, it can block other datasets in the same pipeline. For example, if loads are scheduled every 5 minutes but a dataset takes 8 minutes to process, the following datasets will not begin until the first one completes. This can delay when data is available for reporting.
Consider multiple pipelines. To avoid delays, you can configure more than one pipeline:
- Place your high-priority datasets in a dedicated pipeline with a tighter schedule.
- Place less critical datasets in another pipeline with a more relaxed schedule.
Having multiple pipelines that use the same configured source will not impact pricing.